diff --git a/environment.yml b/environment.yml
index b1109c041..dba2f3790 100644
--- a/environment.yml
+++ b/environment.yml
@@ -14,6 +14,7 @@ dependencies:
     - ghp-import==1.1.0
     - sphinxcontrib-youtube==1.1.0
     - sphinx-togglebutton==0.3.1
+    - arviz==0.12.1
     # Sandpit Requirements
     - quantecon
     - array-to-latex
diff --git a/lectures/ar1_bayes.md b/lectures/ar1_bayes.md
index 53c3d5b58..f49adc3de 100644
--- a/lectures/ar1_bayes.md
+++ b/lectures/ar1_bayes.md
@@ -11,9 +11,7 @@ kernelspec:
   name: python3
 ---
 
-## Posterior Distributions for  AR(1) Parameters
-
-
+# Posterior Distributions for  AR(1) Parameters
 
 We'll begin with some Python imports.
 
@@ -174,7 +172,7 @@ Now we shall use  Bayes' law to construct a posterior distribution, conditioning
 
 First we'll use **pymc4**.
 
-### `PyMC` Implementation
+## `PyMC` Implementation
 
 For a  normal distribution in `pymc`, 
 $var = 1/\tau = \sigma^{2}$.
@@ -286,7 +284,7 @@ We'll return to this issue after we use `numpyro` to compute posteriors under ou
 
 We'll now repeat the calculations using  `numpyro`. 
 
-### `Numpyro` Implementation
+## `Numpyro` Implementation
 
 ```{code-cell} ipython3
 
diff --git a/lectures/ar1_turningpts.md b/lectures/ar1_turningpts.md
index c1f0e78b0..34d3eb490 100644
--- a/lectures/ar1_turningpts.md
+++ b/lectures/ar1_turningpts.md
@@ -13,6 +13,12 @@ kernelspec:
 
 # Forecasting  an AR(1) process
 
+```{code-cell} ipython3
+:tags: [hide-output]
+
+!pip install arviz pymc
+```
+
 This lecture describes methods for forecasting statistics that are functions of future values of a univariate autogressive process.  
 
 The methods are designed to take into account two possible sources of uncertainty about these statistics:
@@ -25,7 +31,7 @@ We consider two sorts of statistics:
 
 - prospective values $y_{t+j}$ of a random process $\{y_t\}$ that is governed by the AR(1) process
 
-- sample path properties that are defined as non-linear functions of future values $\{y_{t+j}\}_{j\geq 1}$ at time $t$.
+- sample path properties that are defined as non-linear functions of future values $\{y_{t+j}\}_{j \geq 1}$ at time $t$.
 
 **Sample path properties** are things like "time to next turning point" or "time to next recession"
 
@@ -54,54 +60,48 @@ logger = logging.getLogger('pymc')
 logger.setLevel(logging.CRITICAL)
 ```
 
-## A Univariate First-Order  Autoregressive   Process
-
+## A Univariate First-Order Autoregressive Process
 
 Consider the univariate AR(1) model: 
 
 $$ 
 y_{t+1} = \rho y_t + \sigma \epsilon_{t+1}, \quad t \geq 0 
-$$ (eq1) 
+$$ (ar1-tp-eq1) 
 
 where the scalars $\rho$ and $\sigma$ satisfy $|\rho| < 1$ and $\sigma > 0$; 
 $\{\epsilon_{t+1}\}$ is a sequence of i.i.d. normal random variables with mean $0$ and variance $1$. 
 
-The  initial condition $y_{0}$ is a known number. 
+The initial condition $y_{0}$ is a known number. 
 
-Equation {eq}`eq1` implies that for $t \geq 0$, the conditional density of $y_{t+1}$ is
+Equation {eq}`ar1-tp-eq1` implies that for $t \geq 0$, the conditional density of $y_{t+1}$ is
 
 $$
 f(y_{t+1} | y_{t}; \rho, \sigma) \sim {\mathcal N}(\rho y_{t}, \sigma^2) \
-$$ (eq2)
-
-
-Further, equation {eq}`eq1` also implies that for $t \geq 0$, the conditional density of $y_{t+j}$ for $j \geq 1$ is
+$$ (ar1-tp-eq2)
 
+Further, equation {eq}`ar1-tp-eq1` also implies that for $t \geq 0$, the conditional density of $y_{t+j}$ for $j \geq 1$ is
 
 $$
 f(y_{t+j} | y_{t}; \rho, \sigma) \sim {\mathcal N}\left(\rho^j y_{t}, \sigma^2 \frac{1 - \rho^{2j}}{1 - \rho^2} \right) 
-$$ (eq3)
+$$ (ar1-tp-eq3)
 
-
-The predictive distribution {eq}`eq3` that assumes that the parameters $\rho, \sigma$ are known, which we express
+The predictive distribution {eq}`ar1-tp-eq3` that assumes that the parameters $\rho, \sigma$ are known, which we express
 by conditioning on them.
 
 We also want to compute a  predictive distribution that does not condition on $\rho,\sigma$ but instead takes account of our uncertainty about them.
 
-We form this predictive distribution by integrating {eq}`eq3` with respect to a joint posterior distribution $\pi_t(\rho,\sigma | y^t )$ 
+We form this predictive distribution by integrating {eq}`ar1-tp-eq3` with respect to a joint posterior distribution $\pi_t(\rho,\sigma | y^t)$ 
 that conditions on an observed history $y^t = \{y_s\}_{s=0}^t$:
 
 $$ 
 f(y_{t+j} | y^t)  = \int f(y_{t+j} | y_{t}; \rho, \sigma) \pi_t(\rho,\sigma | y^t ) d \rho d \sigma
-$$ (eq4)
-
-
+$$ (ar1-tp-eq4)
 
-Predictive distribution {eq}`eq3` assumes that parameters $(\rho,\sigma)$ are known.
+Predictive distribution {eq}`ar1-tp-eq3` assumes that parameters $(\rho,\sigma)$ are known.
 
-Predictive distribution {eq}`eq4` assumes that parameters $(\rho,\sigma)$ are uncertain, but have known probability distribution $\pi_t(\rho,\sigma | y^t )$.
+Predictive distribution {eq}`ar1-tp-eq4` assumes that parameters $(\rho,\sigma)$ are uncertain, but have known probability distribution $\pi_t(\rho,\sigma | y^t )$.
 
-We also  want  to compute some  predictive distributions of "sample path statistics" that might  include, for example
+We also want to compute some  predictive distributions of "sample path statistics" that might  include, for example
 
 - the time until the next "recession",
 - the minimum value of $Y$ over the next 8 periods,
@@ -115,20 +115,16 @@ To accomplish that for situations in which we are uncertain about parameter valu
 - for each draw $n=0,1,...,N$, simulate a "future path" of length $T_1$ with parameters $\left(\rho_n,\sigma_n\right)$ and compute our three "sample path statistics";
 - finally, plot the desired statistics from the $N$ samples as an empirical distribution.
 
-    
-
 ## Implementation
 
 First, we'll simulate a  sample path from which to launch our forecasts.  
 
 In addition to plotting the sample path, under the assumption that the true parameter values are known,
 we'll plot $.9$ and $.95$ coverage intervals using conditional distribution
-{eq}`eq3` described above. 
+{eq}`ar1-tp-eq3` described above. 
 
 We'll also plot a bunch of samples of sequences of future values and watch where they fall relative to the coverage interval.  
 
-
-
 ```{code-cell} ipython3
 def AR1_simulate(rho, sigma, y0, T):
 
@@ -198,21 +194,21 @@ Wecker {cite}`wecker1979predicting` proposed using simulation techniques to char
 
 He called these functions "path properties" to contrast them with properties of single data points.
 
-He studied   two special  prospective path properties of a given series $\{y_t\}$. 
+He studied two special  prospective path properties of a given series $\{y_t\}$. 
 
-The first  was **time until the next turning point**
+The first was **time until the next turning point**
 
-   * he defined a **"turning point"** to be the date of the  second of two successive declines in  $y$. 
+* he defined a **"turning point"** to be the date of the  second of two successive declines in  $y$. 
 
-To examine this statistic,  let $Z$ be an indicator process
+To examine this statistic, let $Z$ be an indicator process
 
-$$
+<!-- $$
 Z_t(Y(\omega)) := \left\{ 
 \begin{array} {c}
 \ 1 & \text{if } Y_t(\omega)< Y_{t-1}(\omega)< Y_{t-2}(\omega) \geq Y_{t-3}(\omega) \\
 0 & \text{otherwise}
 \end{array} \right.
-$$
+$$ -->
 
 Then the random variable **time until the next turning point**  is defined as the following **stopping time** with respect to $Z$:
 
@@ -220,8 +216,8 @@ $$
 W_t(\omega):= \inf \{ k\geq 1 \mid Z_{t+k}(\omega) = 1\}
 $$
 
-Wecker  {cite}`wecker1979predicting`  also studied   **the minimum value of $Y$ over the next 8 quarters**
-which can be defined   as the random variable 
+Wecker  {cite}`wecker1979predicting` also studied **the minimum value of $Y$ over the next 8 quarters**
+which can be defined as the random variable 
 
 $$ 
 M_t(\omega) := \min \{ Y_{t+1}(\omega); Y_{t+2}(\omega); \dots; Y_{t+8}(\omega)\}
@@ -230,36 +226,34 @@ $$
 It is interesting to study yet another possible concept of a **turning point**.
 
 Thus, let
-
+<!-- 
 $$
 T_t(Y(\omega)) := \left\{ 
 \begin{array}{c}
 \ 1 & \text{if } Y_{t-2}(\omega)> Y_{t-1}(\omega) > Y_{t}(\omega) \ \text{and } \ Y_{t}(\omega) < Y_{t+1}(\omega) < Y_{t+2}(\omega) \\
-\  -1 & \text{if } Y_{t-2}(\omega)< Y_{t-1}(\omega) < Y_{t}(\omega) \ \text{and } \ Y_{t}(\omega) > Y_{t+1}(\omega) > Y_{t+2}(\omega) \\
+\ -1 & \text{if } Y_{t-2}(\omega)< Y_{t-1}(\omega) < Y_{t}(\omega) \ \text{and } \ Y_{t}(\omega) > Y_{t+1}(\omega) > Y_{t+2}(\omega) \\
 0 & \text{otherwise}
 \end{array} \right.
-$$
+$$ -->
 
 Define a **positive turning point today or tomorrow** statistic as 
 
-$$
+<!-- $$
 P_t(\omega) := \left\{ 
 \begin{array}{c}
 \ 1 & \text{if } T_t(\omega)=1 \ \text{or} \ T_{t+1}(\omega)=1 \\
 0 & \text{otherwise}
 \end{array} \right.
-$$
+$$ -->
 
 This is designed to express the event
 
- - ``after one or two decrease(s), $Y$ will grow for two consecutive quarters'' 
-
+- ``after one or two decrease(s), $Y$ will grow for two consecutive quarters'' 
 
 Following {cite}`wecker1979predicting`, we can use simulations to calculate  probabilities of $P_t$ and $N_t$ for each period $t$. 
 
 ## A Wecker-Like Algorithm
 
-
 The procedure consists of the following steps: 
 
 * index a sample path by $\omega_i$ 
@@ -270,11 +264,9 @@ $$
 Y(\omega_i) = \left\{ Y_{t+1}(\omega_i), Y_{t+2}(\omega_i), \dots, Y_{t+N}(\omega_i)\right\}_{i=1}^I
 $$
 
-*  for each path $\omega_i$, compute the associated value of $W_t(\omega_i), W_{t+1}(\omega_i), \dots$
+* for each path $\omega_i$, compute the associated value of $W_t(\omega_i), W_{t+1}(\omega_i), \dots$
 
-*  consider the sets $
-\{W_t(\omega_i)\}^{T}_{i=1}, \ \{W_{t+1}(\omega_i)\}^{T}_{i=1}, \ \dots, \ \{W_{t+N}(\omega_i)\}^{T}_{i=1}
-$ as samples from the predictive distributions $f(W_{t+1} \mid \mathcal y_t, \dots)$, $f(W_{t+2} \mid y_t, y_{t-1}, \dots)$, $\dots$, $f(W_{t+N} \mid y_t, y_{t-1}, \dots)$.
+* consider the sets $\{W_t(\omega_i)\}^{T}_{i=1}, \ \{W_{t+1}(\omega_i)\}^{T}_{i=1}, \ \dots, \ \{W_{t+N}(\omega_i)\}^{T}_{i=1}$ as samples from the predictive distributions $f(W_{t+1} \mid \mathcal y_t, \dots)$, $f(W_{t+2} \mid y_t, y_{t-1}, \dots)$, $\dots$, $f(W_{t+N} \mid y_t, y_{t-1}, \dots)$.
 
 
 ## Using Simulations to Approximate a Posterior Distribution
@@ -283,7 +275,6 @@ The next code cells use `pymc` to compute the time $t$ posterior distribution of
 
 Note that in defining the likelihood function, we choose to condition on the initial value $y_0$.
 
-
 ```{code-cell} ipython3
 def draw_from_posterior(sample):
     """
@@ -328,7 +319,6 @@ The graphs on the left portray posterior marginal distributions.
 
 ## Calculating Sample Path Statistics
 
-
 Our next step is to prepare Python codeto compute our sample path statistics.
 
 ```{code-cell} ipython3
@@ -455,9 +445,9 @@ plt.show()
 ## Extended Wecker Method
 
 Now we apply we apply our  "extended" Wecker method based on  predictive densities of $y$ defined by
-{eq}`eq4` that acknowledge posterior uncertainty in the parameters $\rho, \sigma$.
+{eq}`ar1-tp-eq4` that acknowledge posterior uncertainty in the parameters $\rho, \sigma$.
 
-To approximate  the intergration on the right side of {eq}`eq4`, we  repeately draw parameters from the joint posterior distribution each time we simulate a sequence of future values from model {eq}`eq1`.
+To approximate  the intergration on the right side of {eq}`ar1-tp-eq4`, we  repeately draw parameters from the joint posterior distribution each time we simulate a sequence of future values from model {eq}`ar1-tp-eq1`.
 
 ```{code-cell} ipython3
 def plot_extended_Wecker(post_samples, initial_path, N, ax):
@@ -525,4 +515,3 @@ plot_extended_Wecker(post_samples, initial_path, 1000, ax)
 plt.legend()
 plt.show()
 ```
-
diff --git a/lectures/bayes_nonconj.md b/lectures/bayes_nonconj.md
index f92c1ff07..2d17b6f90 100644
--- a/lectures/bayes_nonconj.md
+++ b/lectures/bayes_nonconj.md
@@ -888,7 +888,7 @@ SVI_num_steps = 5000
 true_theta = 0.8
 ```
 
-#### Beta Prior and Posteriors
+### Beta Prior and Posteriors:
 
 Let's compare outcomes when we use a Beta prior.
 
@@ -944,7 +944,7 @@ Here the MCMC approximation looks good.
 
 But the VI approximation doesn't look so good.
 
-  * even though we use the  beta distribution as our guide, the VI approximated posterior distributions do not closely resemble the posteriors that we had just computed analytically. 
+* even though we use the  beta distribution as our guide, the VI approximated posterior distributions do not closely resemble the posteriors that we had just computed analytically. 
 
 (Here, our initial parameter for Beta guide is (0.5, 0.5).)
 
@@ -960,8 +960,6 @@ BayesianInferencePlot(true_theta, num_list, BETA_numpyro).SVI_plot(guide_dist='b
 ```
 
 
-
-
 ## Non-conjugate Prior Distributions
 
 Having assured ourselves that our MCMC and VI methods can work well when we have  conjugate prior and so can also compute analytically, we 
diff --git a/lectures/lagrangian_lqdp.md b/lectures/lagrangian_lqdp.md
index 1ab5e3442..1ce8eba87 100644
--- a/lectures/lagrangian_lqdp.md
+++ b/lectures/lagrangian_lqdp.md
@@ -160,7 +160,7 @@ For the undiscounted optimal linear regulator problem, form the Lagrangian
 $$
 {\cal L} = - \sum^\infty_{t=0} \biggl\{ x^\prime_t R x_t + u_t^\prime Q u_t +
                                  2 \mu^\prime_{t+1} [A x_t + B u_t - x_{t+1}]\biggr\}
-$$ (eq1)
+$$ (lag-lqdp-eq1)
 
 where $2 \mu_{t+1}$ is a vector of Lagrange multipliers on the time $t$ transition law $x_{t+1} = A x_t + B u_t$.
 
@@ -172,16 +172,16 @@ $$
 \begin{aligned}
 2 Q u_t &+ 2B^\prime \mu_{t+1} = 0 \ ,\ t \geq 0 \cr \mu_t &= R x_t + A^\prime \mu_{t+1}\ ,\ t\geq 1.\cr
 \end{aligned}
-$$ (eq2)
+$$ (lag-lqdp-eq2)
 
-Define $\mu_0$ to be  a vector of shadow prices of $x_0$ and apply an envelope condition to {eq}`eq1`
+Define $\mu_0$ to be  a vector of shadow prices of $x_0$ and apply an envelope condition to {eq}`lag-lqdp-eq1`
  to deduce that
 
 $$
 \mu_0 = R x_0 + A' \mu_1,
 $$
 
-which is a time $t=0 $ counterpart to the second equation of system {eq}`eq2`.
+which is a time $t=0 $ counterpart to the second equation of system {eq}`lag-lqdp-eq2`.
 
 An important fact is  that  
 
@@ -199,11 +199,11 @@ corresponds to the **state** vector $x_t$.
 
 It is useful to proceed with the following steps:
 
-* solve the first equation of {eq}`eq2`  for $u_t$ in terms of $\mu_{t+1}$.
+* solve the first equation of {eq}`lag-lqdp-eq2`  for $u_t$ in terms of $\mu_{t+1}$.
 
 * substitute the result into the law of motion $x_{t+1} = A x_t + B u_t$.
 
-* arrange the resulting equation and the second equation of {eq}`eq2`  into the form
+* arrange the resulting equation and the second equation of {eq}`lag-lqdp-eq2`  into the form
 
 $$
 L\ \begin{pmatrix}x_{t+1}\cr \mu_{t+1}\cr\end{pmatrix}\ = \ N\ \begin{pmatrix}x_t\cr \mu_t\cr\end{pmatrix}\
@@ -271,7 +271,7 @@ The rank of $J$ is $2n$.
 
 $$
 MJM^\prime = J.
-$$ (eq3)
+$$ (lag-lqdp-eq3)
 
 Salient properties of symplectic matrices that are readily verified include:
 
@@ -280,14 +280,14 @@ Salient properties of symplectic matrices that are readily verified include:
 
 It can be verified directly that $M$ in equation {eq}`Mdefn` is symplectic.
 
-It follows from equation {eq}`eq3` and from the fact $J^{-1} = J^\prime = -J$ that for any symplectic
+It follows from equation {eq}`lag-lqdp-eq3` and from the fact $J^{-1} = J^\prime = -J$ that for any symplectic
 matrix $M$,
 
 $$
 M^\prime = J^{-1} M^{-1} J.
-$$ (eq4)
+$$ (lag-lqdp-eq4)
 
-Equation {eq}`eq4` states that $M^\prime$ is related to the inverse of $M$
+Equation {eq}`lag-lqdp-eq4` states that $M^\prime$ is related to the inverse of $M$
 by a **similarity transformation**.
 
 For square matrices, recall that  
@@ -298,7 +298,7 @@ For square matrices, recall that
 
 * a matrix and its transpose share eigenvalues
 
-It then follows from equation {eq}`eq4`  that
+It then follows from equation {eq}`lag-lqdp-eq4`  that
 the eigenvalues of $M$ occur in reciprocal pairs: if $\lambda$ is an
 eigenvalue of $M$, so is $\lambda^{-1}$.
 
@@ -809,7 +809,7 @@ $$
 
 which is a time $t=0 $ counterpart to the second equation of system {eq}`eq662`. 
 
-Proceeding as we did above with  the undiscounted system  {eq}`eq2`, we can rearrange the first-order conditions into the
+Proceeding as we did above with  the undiscounted system  {eq}`lag-lqdp-eq2`, we can rearrange the first-order conditions into the
 system
 
 $$
@@ -821,7 +821,7 @@ $$
 \left[\begin{matrix} x_t \cr \mu_t \end{matrix}\right]
 $$ (eq663)
 
-which in the special case that $\beta = 1$ agrees with equation {eq}`eq2`, as expected.
+which in the special case that $\beta = 1$ agrees with equation {eq}`lag-lqdp-eq2`, as expected.
 
 +++
 
diff --git a/lectures/prob_matrix.md b/lectures/prob_matrix.md
index f2fdaa784..4a511b657 100644
--- a/lectures/prob_matrix.md
+++ b/lectures/prob_matrix.md
@@ -19,17 +19,17 @@ After providing somewhat informal definitions of the underlying objects, we'll u
 
 Among concepts that we'll be studying include
 
- - a joint probability distribution 
- - marginal distributions associated with a given joint distribution
- - conditional probability distributions
- - statistical independence of two random variables
- - joint distributions associated with a prescribed set of marginal distributions
-     - couplings
-     - copulas
- - the probability distribution of a sum of two independent random variables 
-     - convolution of  marginal distributions
- - parameters that define a probability distribution
- - sufficient statistics as data summaries
+- a joint probability distribution 
+- marginal distributions associated with a given joint distribution
+- conditional probability distributions
+- statistical independence of two random variables
+- joint distributions associated with a prescribed set of marginal distributions
+    - couplings
+    - copulas
+- the probability distribution of a sum of two independent random variables 
+    - convolution of  marginal distributions
+- parameters that define a probability distribution
+- sufficient statistics as data summaries
   
 We'll use a matrix to represent a bivariate probability distribution and a vector to represent a univariate probability distribution
 
@@ -57,13 +57,8 @@ We'll briefly define what we mean by a **probability space**, a **probability me
 
 For most of this lecture, we sweep these objects into the background, but they are there underlying the other objects that we'll mainly focus on.
 
-
-
-
 Let $\Omega$ be a set of possible underlying outcomes and let $\omega \in \Omega$ be a particular underlying outcomes.
 
-
-
 Let $\mathcal{G} \subset \Omega$ be a subset of $\Omega$. 
 
 Let $\mathcal{F}$ be a collection of such subsets  $\mathcal{G} \subset \Omega$. 
@@ -72,7 +67,7 @@ The pair $\Omega,\mathcal{F}$  forms our **probability space** on which we want
 
 A **probability measure** $\mu$ maps a set of possible underlying outcomes  $\mathcal{G} \in \mathcal{F}$  into a scalar number between $0$ and $1$
 
- - this is the "probability" that $X$ belongs to $A$, denoted by $ \textrm{Prob}\{X\in A\}$.
+- this is the "probability" that $X$ belongs to $A$, denoted by $ \textrm{Prob}\{X\in A\}$.
 
 A **random variable** $X(\omega)$ is a function of the underlying outcome $\omega \in \Omega$.
 
@@ -89,8 +84,6 @@ where ${\mathcal G}$ is the subset of $\Omega$ for which $X(\omega) \in A$.
 We call this the induced probability distribution of random variable $X$.
 
 
-
-
 ## Digression: What Does Probability Mean? 
 
 Before diving in, we'll say a few words about what probability theory means and how it connects to statistics.
@@ -103,7 +96,6 @@ These are purely mathematical objects.
 
 To appreciate how statisticians connect probabilities to data, the key is to understand the following concepts:
 
-
 * A single draw from a probability distribution
 * Repeated independently  and identically distributed (i.i.d.)  draws of "samples" or "realizations" from the same probability distribution
 * A **statistic** defined as a  function of a sequence of samples
@@ -111,7 +103,7 @@ To appreciate how statisticians connect probabilities to data, the key is to und
 * The idea that a  population probability  distribution is  what we anticipate **relative frequencies** will be in a long sequence of i.i.d. draws. Here the following mathematical machinery makes precise what is meant by **anticipated relative frequencies** 
      - **Law of Large Numbers (LLN)**
      -  **Central Limit Theorem (CLT)** 
-+++
+
 
 **Scalar example**
 
@@ -136,41 +128,38 @@ $$
 \end{aligned}
 $$
 
-
-
 Consider the **empirical distribution**:
 
-\begin{align}
+$$
+\begin{aligned}
 i & = 0,\dots,I-1,\\
 N_i & = \text{number of times} \ X = i,\\
 N & = \sum^{I-1}_{i=0} N_i \quad \text{total number of draws},\\
 \tilde {f_i} &  = \frac{N_i}{N} \sim \ \text{frequency of draws for which}\  X=i
-\end{align} 
+\end{aligned} 
+$$
 
 
 Key ideas that  justify connecting probability theory with statistics are laws of large numbers and central limit theorems
 
 **LLN:** 
 
-  - A Law of Large Numbers (LLN) states that $\tilde {f_i} \to f_i \text{ as } N \to \infty$
+- A Law of Large Numbers (LLN) states that $\tilde {f_i} \to f_i \text{ as } N \to \infty$
 
 **CLT:** 
 
-   - A Central Limit Theorem (CLT) describes a  **rate** at which $\tilde {f_i} \to f_i$
+- A Central Limit Theorem (CLT) describes a  **rate** at which $\tilde {f_i} \to f_i$
 
 
 **Remarks** 
 
- - For "frequentist" statisticians, **anticipated relative frequency**  is **all** that a probability distribution means. 
-
- - But for a Bayesian it means something more or different.
+- For "frequentist" statisticians, **anticipated relative frequency**  is **all** that a probability distribution means. 
 
+- But for a Bayesian it means something more or different.
 
 
 ## Representing  Probability Distributions
 
- 
-
 A  probability distribution $\textrm{Prob} (X \in A)$ can  be described by its **cumulative distribution function (CDF)**
 
 $$
@@ -194,9 +183,9 @@ When a probability density exists, a probability distribution can be characteriz
 
 For a **discrete-valued** random variable  
 
- * the number  of possible values of $X$ is finite or countably infinite 
- * we replace a  **density** with a **probability mass function**, a non-negative sequence that sums to one 
- * we replace integration with summation in the formula like {eq}`eq:CDFfromdensity` that relates a CDF to a probability mass function 
+* the number  of possible values of $X$ is finite or countably infinite 
+* we replace a  **density** with a **probability mass function**, a non-negative sequence that sums to one 
+* we replace integration with summation in the formula like {eq}`eq:CDFfromdensity` that relates a CDF to a probability mass function 
 
 
 In this lecture, we mostly discuss discrete random variables.  
@@ -204,9 +193,7 @@ In this lecture, we mostly discuss discrete random variables.
 Doing this enables us to confine our tool set basically to linear algebra.
 
 Later we'll briefly discuss how to approximate a continuous random variable with a discrete random variable.
- 
 
-+++
 
 ## Univariate Probability Distributions
 
@@ -259,11 +246,12 @@ where $\theta $ is a vector of parameters that is of much smaller dimension than
 
 
 **Remarks:**
- - The concept of  **parameter** is intimately related to the notion of  **sufficient statistic**.
- -  Sufficient statistic are  nonlinear function of a data set.
- -  Sufficient statistics are designed to  summarize all  **information** about the parameters that is contained in the big data set. 
- -  They are important tools that AI uses to reduce the size of a **big data** set
- -  R. A. Fisher provided a sharp definition of **information** -- see <https://en.wikipedia.org/wiki/Fisher_information>
+
+- The concept of  **parameter** is intimately related to the notion of  **sufficient statistic**.
+-  Sufficient statistic are  nonlinear function of a data set.
+-  Sufficient statistics are designed to  summarize all  **information** about the parameters that is contained in the big data set. 
+-  They are important tools that AI uses to reduce the size of a **big data** set
+-  R. A. Fisher provided a sharp definition of **information** -- see <https://en.wikipedia.org/wiki/Fisher_information>
 
 
  
@@ -283,8 +271,6 @@ $$
 f_i( \theta)\ge0, \sum_{i=0}^{\infty}f_i(\theta)=1
 $$
 
-+++
-
 ### Continuous random variable
 
 Let $X$ be a continous random variable that takes values $X \in \tilde{X}\equiv[X_U,X_L]$ whose distributions have parameters $\theta$.
@@ -299,15 +285,12 @@ $$
 \textrm{Prob}\{X\in \tilde{X}\} =1 
 $$
 
-+++
-
 ## Bivariate Probability Distributions
 
 We'll now discuss a bivariate **joint distribution**.
 
 To begin, we restrict ourselves to two discrete random variables.
 
- 
 Let $X,Y$ be two discrete random variables that take values:
 
 $$
@@ -335,7 +318,6 @@ where
 $$
 \sum_{i}\sum_{j}f_{ij}=1
 $$
-+++
 
 ## Marginal Probability Distributions
 
@@ -349,8 +331,6 @@ $$
 \textrm{Prob}\{Y=j\}= \sum_{i=0}^{I-1}f_{ij} = \nu_i, i=0,\ldots,J-1 
 $$
 
-
-
 For example, let the joint distribution over $(X,Y)$ be 
 
 $$
@@ -362,21 +342,18 @@ F = \left[
 \right]
 $$ (eq:example101discrete)
 
-
 Then marginal distributions are:
 
-
 $$ 
-\begin{aligned} \textrm{Prob} \{X=0\}&=.25+.1=.35\\
+\begin{aligned} 
+\textrm{Prob} \{X=0\}&=.25+.1=.35\\
 \textrm{Prob}\{X=1\}& =.15+.5=.65\\
 \textrm{Prob}\{Y=0\}&=.25+.15=.4\\
 \textrm{Prob}\{Y=1\}&=.1+.5=.6
 \end{aligned}
 $$
 
-
-
-**Digression:** If  two random variables $X,Y$ are continuous and have joint density $f(x,y)$, then  marginal distributions can be computed by 
+**Digression:** If two random variables $X,Y$ are continuous and have joint density $f(x,y)$, then marginal distributions can be computed by 
 
 $$
 \begin{aligned}
@@ -385,8 +362,6 @@ f(y)& = \int_{\mathbb{R}} f(x,y) dx
 \end{aligned}
 $$
 
-+++
-
 ## Conditional Probability  Distributions
 
 Conditional probabilities are defined according to
@@ -425,8 +400,6 @@ $$
 \textrm{Prob}\{X=0|Y=1\} =\frac{ .1}{.1+.5}=\frac{.1}{.6}
 $$
 
-+++
-
 ## Statistical Independence
 
 Random variables X and Y are statistically **independent** if 
@@ -438,28 +411,24 @@ $$
 where 
 
 $$ 
-\begin{align}
+\begin{aligned}
 \textrm{Prob}\{X=i\} &=f_i\ge0， \sum{f_i}=1 \cr
 \textrm{Prob}\{Y=j\} & =g_j\ge0， \sum{g_j}=1
-\end{align}
+\end{aligned}
 $$
 
 Conditional distributions are 
 
 $$
-\begin{align}
+\begin{aligned}
 \textrm{Prob}\{X=i|Y=j\} & =\frac{f_ig_i}{\sum_{i}f_ig_j}=\frac{f_ig_i}{g_i}=f_i \\
 \textrm{Prob}\{Y=j|X=i\} & =\frac{f_ig_i}{\sum_{j}f_ig_j}=\frac{f_ig_i}{f_i}=g_i
-\end{align}
+\end{aligned}
 $$
 
-+++
 
 ## Means and Variances
 
-+++
-
-
 The  mean and variance of a discrete random variable $X$  are
 
 $$
@@ -480,10 +449,6 @@ $$
 $$
 
 
-
-
-
-
 ## Classic Trick for Generating Random Numbers
 
 Suppose we have at our disposal a pseudo random number that draws a uniform random variable, i.e., one with probability distribution
@@ -526,7 +491,7 @@ Thus, suppose that
 It turns out that if we use draw uniform random numbers $U$ and then compute  $X$ from 
 
 $$
-X=F^{-1}(U)$,
+X=F^{-1}(U),
 $$
 
 then $X$ ia a random variable  with CDF $F_X(x)=F(x)=\textrm{Prob}\{X\le x\}$.
@@ -613,19 +578,23 @@ plt.show()
 
 Let $X$ distributed geometrically, that is
 
-\begin{align} 
+$$
+\begin{aligned} 
 \textrm{Prob}(X=i) & =(1-\lambda)\lambda^i,\quad\lambda\in(0,1), \quad  i=0,1,\dots \\
  & \sum_{i=0}^{\infty}\textrm{Prob}(X=i)=1\longleftrightarrow(1- \lambda)\sum_{i=0}^{\infty}\lambda^i=\frac{1-\lambda}{1-\lambda}=1
-\end{align}
+\end{aligned}
+$$
 
 Its CDF is given by
 
-\begin{align}
+$$
+\begin{aligned}
 \textrm{Prob}(X\le i)& =(1-\lambda)\sum_{j=0}^{i}\lambda^i\\
 & =(1-\lambda)[\frac{1-\lambda^{i+1}}{1-\lambda}]\\
 & =1-\lambda^{i+1}\\
 & =F(X)=F_i \quad 
-\end{align}
+\end{aligned}
+$$
 
 Again, let $\tilde{U}$ follow a uniform distribution and we want to find $X$ such that $F(X)=\tilde{U}$.
 
@@ -688,9 +657,7 @@ plt.hist(x_g, bins=150, density=True, alpha=0.6)
 plt.show()
 ```
 
-
-
-## Some  Discrete Probability Distributions
+## Some Discrete Probability Distributions
 
 
 Let's write some Python code to compute   means and variances of soem  univariate random variables.
@@ -703,11 +670,17 @@ We'll use our code to
 
 ## Geometric distribution  
 
-$$ \textrm{Prob}(X=k)=(1-p)^{k-1}p,k=1,2, \ldots $$
+$$
+\textrm{Prob}(X=k)=(1-p)^{k-1}p,k=1,2, \ldots 
+$$
+
 $\implies$
 
-$$\begin{align}
-\mathbb{E}(X) & =\frac{1}{p}\\\mathbb{D}(X) & =\frac{1-p}{p^2} \end{align}$$
+$$
+\begin{aligned}
+\mathbb{E}(X) & =\frac{1}{p}\\\mathbb{D}(X) & =\frac{1-p}{p^2} 
+\end{aligned}
+$$
 
 We draw observations from the distribution and compare the sample mean and variance with the theoretical results.
 
@@ -797,7 +770,8 @@ Its distribution is
 $$ 
 \begin{aligned}
 X  & \sim NB(r,p) \\
-\textrm{Prob}(X=k;r,p) & = \begin{pmatrix}k+r-1 \\ r-1 \end{pmatrix}p^r(1-p)^{k} \end{aligned}
+\textrm{Prob}(X=k;r,p) & = \begin{pmatrix}k+r-1 \\ r-1 \end{pmatrix}p^r(1-p)^{k} 
+\end{aligned}
 $$
 
 Here, we choose from among $k+r-1$ possible outcomes  because the last draw is by definition a success.
@@ -928,8 +902,6 @@ $$
 f(x) = 0.0005
 $$
 
-+++
-
 Let's start by generating a random sample and computing sample moments.
 
 ```{code-cell} ipython3
@@ -969,19 +941,10 @@ print("mean: ", mean)
 print("variance: ", var)
 ```
 
-
-
-
-
 ## Matrix Representation of Some Bivariate Distributions
 
-+++
-
 Let's use matrices to represent a joint distribution, conditional distribution, marginal distribution, and the mean and variance of a  bivariate random variable.
 
-+++
-
-
 The table below illustrates a  probability distribution  for a bivariate random variable. 
 
 $$
@@ -996,13 +959,9 @@ Marginal distributions are
 $$ \textrm{Prob}(X=i)=\sum_j{f_{ij}}=u_i  $$
 $$ \textrm{Prob}(Y=j)=\sum_i{f_{ij}}=v_j $$
 
-
-
 Below we draw some samples  confirm that the "sampling" distribution agrees well  with the "population" distribution.
 
-+++
-
-#### Sample results
+**Sample results:**
 
 ```{code-cell} ipython3
 # specify parameters
@@ -1129,9 +1088,6 @@ $$
 
 These population objects closely resemble  sample counterparts  computed above.
 
-+++
-
-
 Let's wrap some of the functions we have used in  a Python class for a general discrete bivariate joint distribution.
 
 ```{code-cell} ipython3
@@ -1287,9 +1243,6 @@ d_new.marg_dist()
 d_new.cond_dist()
 ```
 
-
-+++
-
 ## A Continuous Bivariate Random Vector 
 
 
@@ -1301,9 +1254,7 @@ $$
 
 
 $$
-\begin{equation} 
 \frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}\exp\left[-\frac{1}{2(1-\rho^2)}\left(\frac{(x-\mu_1)^2}{\sigma_1^2}-\frac{2\rho(x-\mu_1)(y-\mu_2)}{\sigma_1\sigma_2}+\frac{(y-\mu_2)^2}{\sigma_2^2}\right)\right] 
-\end{equation}
 $$
 
 We start with a  bivariate normal distribution pinned down by
@@ -1407,9 +1358,9 @@ plt.show()
 The population conditional distribution is
 
 $$
-\begin{aligned} 
-[X|Y & =y ]\sim \mathbb{N}\bigg[\mu_X+\rho\sigma_X\frac{y-\mu_Y}{\sigma_Y},\sigma_X^2(1-\rho^2)\bigg] \\
-[Y|X= &x ]\sim \mathbb{N}\bigg[\mu_Y+\rho\sigma_Y\frac{x-\mu_X}{\sigma_X},\sigma_Y^2(1-\rho^2)\bigg] 
+\begin{aligned} \\
+[X|Y &= y ]\sim \mathbb{N}\bigg[\mu_X+\rho\sigma_X\frac{y-\mu_Y}{\sigma_Y},\sigma_X^2(1-\rho^2)\bigg] \\
+[Y|X &= x ]\sim \mathbb{N}\bigg[\mu_Y+\rho\sigma_Y\frac{x-\mu_X}{\sigma_X},\sigma_Y^2(1-\rho^2)\bigg] 
 \end{aligned} 
 $$
 
@@ -1494,7 +1445,8 @@ Define a new random variable $Z=X+Y$.
 Evidently, $Z$ takes values from $\bar{Z}$ defined as follows:
 
 $$
-\begin{aligned} \bar{X} & =\{0,1,\ldots,I-1\};\qquad f_i= \textrm{Prob} \{X=i\}\\
+\begin{aligned} 
+\bar{X} & =\{0,1,\ldots,I-1\};\qquad f_i= \textrm{Prob} \{X=i\}\\
 \bar{Y} & =\{0,1,\ldots,J-1\};\qquad g_j= \textrm{Prob}\{Y=j\}\\
 \bar{Z}& =\{0,1,\ldots,I+J-2\};\qquad h_k=  \textrm{Prob} \{X+Y=k\}
 \end{aligned}
@@ -1523,11 +1475,7 @@ $$
 f_{Z}(z)=\int_{-\infty}^{\infty} f_{X}(x) f_{Y}(z-x) dx \equiv f_{X}*g_{Y}
 $$
 
-where $ f_{X}*g_{Y}$ denotes the **convolution** of the $f_X$ and $g_Y$ functions.
-
-
-+++
-
+where $ f_{X}*g_{Y} $ denotes the **convolution** of the $f_X$ and $g_Y$ functions.
 
 ## Transition Probability Matrix
 
@@ -1538,6 +1486,7 @@ Let $X,Y$ be discrete random variables with joint distribution
 $$
 \textrm{Prob}\{X=i,Y=j\} = \rho_{ij}
 $$
+
 where $i = 0,\dots,I-1; j = 0,\dots,J-1$ and 
 
 $$
@@ -1551,11 +1500,8 @@ $$
 = \frac{\textrm{Prob}\{Y=j, X=i\}}{\textrm{Prob}\{ X=i\}}
 $$
 
-+++
-
 We can define a transition probability matrix
 
-
 $$
 p_{ij}=\textrm{Prob}\{Y=j|X=i\}= \frac{\rho_{ij}}{ \sum_{j}\rho_{ij}}
 $$
@@ -1578,11 +1524,8 @@ The second row is the probability of $Y=j, j=0,1$ conditional on $X=1$.
 Note that 
 - $\sum_{j}\rho_{ij}= \frac{ \sum_{j}\rho_{ij}}{ \sum_{j}\rho_{ij}}=1$, so each row of $\rho$ is a probability distribution (not so for each column.
 
-+++
-
 ## Coupling
 
-
 Start with a joint distribution
 
 $$
@@ -1614,8 +1557,6 @@ We'll find that from two marginal distributions, can we usually construct more t
 
 Each of these joint distributions is called a **coupling** of the two martingal distributions.
 
-+++
-
 Let's start with marginal distributions
 
 $$
@@ -1641,9 +1582,7 @@ $$
 \end{aligned}
 $$
 
-+++
-
-We  construct  two couplings.
+We construct  two couplings.
 
 The first coupling if our two marginal distributions is the joint distribution
 
@@ -1668,8 +1607,6 @@ $$
 \end{aligned}
 $$
 
-+++
-
 A second coupling of our two marginal distributions is the joint distribution
 
 
@@ -1704,15 +1641,13 @@ Thus, multiple  joint distributions $[f_{ij}]$ can have  the same marginals.
 **Remark:**
 - Couplings  are important in optimal transport problems and in Markov processes.
 
-+++
-
 ## Copula Functions
 
 Suppose that $X_1, X_2, \dots, X_n$ are $N$ random variables  and that 
 
-  * their marginal distributions are $F_1(x_1), F_2(x_2),\dots, F_N(x_N)$,  and
-  
-  * their joint distribution is $H(x_1,x_2,\dots,x_N)$
+* their marginal distributions are $F_1(x_1), F_2(x_2),\dots, F_N(x_N)$,  and
+
+* their joint distribution is $H(x_1,x_2,\dots,x_N)$
   
 Then there exists a **copula function** $C(\cdot)$  that verifies
 
@@ -1734,8 +1669,6 @@ Thus, for given marginal distributions, we can use  a copula function to determi
 
 Copula functions are often used to characterize **dependence** of  random variables.
 
-+++
-
 **Discrete marginal distribution**
 
 TOM -- REWRITE OR MAYBE DROP PARTS OF 
@@ -1962,8 +1895,6 @@ We have verified that both joint distributions, $c_1$ and $c_2$, have identical
 
 So they are both couplings of $X$ and $Y$.
 
-+++
-
 ## Time Series 
 
 Suppose that there are two time periods.
@@ -1976,10 +1907,10 @@ Let $X(0)$ be a random variable to be realized at $t=0$, $X(1)$  be a random var
 Suppose that
 
 $$
-\begin{align}
+\begin{aligned}
 \textrm{\textrm{Prob}}\{X(0)=i,X(1)=j\} &=f_{ij}≥0，i=0,……,I-1\\
 \sum_{i}\sum_{j}f_{ij}&=1
-\end{align}
+\end{aligned}
 $$
 
 $f_{ij} $ is a joint distribution over $[X(0), X(1)]$.