diff --git a/environment.yml b/environment.yml index b1109c041..dba2f3790 100644 --- a/environment.yml +++ b/environment.yml @@ -14,6 +14,7 @@ dependencies: - ghp-import==1.1.0 - sphinxcontrib-youtube==1.1.0 - sphinx-togglebutton==0.3.1 + - arviz==0.12.1 # Sandpit Requirements - quantecon - array-to-latex diff --git a/lectures/ar1_bayes.md b/lectures/ar1_bayes.md index 53c3d5b58..f49adc3de 100644 --- a/lectures/ar1_bayes.md +++ b/lectures/ar1_bayes.md @@ -11,9 +11,7 @@ kernelspec: name: python3 --- -## Posterior Distributions for AR(1) Parameters - - +# Posterior Distributions for AR(1) Parameters We'll begin with some Python imports. @@ -174,7 +172,7 @@ Now we shall use Bayes' law to construct a posterior distribution, conditioning First we'll use **pymc4**. -### `PyMC` Implementation +## `PyMC` Implementation For a normal distribution in `pymc`, $var = 1/\tau = \sigma^{2}$. @@ -286,7 +284,7 @@ We'll return to this issue after we use `numpyro` to compute posteriors under ou We'll now repeat the calculations using `numpyro`. -### `Numpyro` Implementation +## `Numpyro` Implementation ```{code-cell} ipython3 diff --git a/lectures/ar1_turningpts.md b/lectures/ar1_turningpts.md index c1f0e78b0..34d3eb490 100644 --- a/lectures/ar1_turningpts.md +++ b/lectures/ar1_turningpts.md @@ -13,6 +13,12 @@ kernelspec: # Forecasting an AR(1) process +```{code-cell} ipython3 +:tags: [hide-output] + +!pip install arviz pymc +``` + This lecture describes methods for forecasting statistics that are functions of future values of a univariate autogressive process. The methods are designed to take into account two possible sources of uncertainty about these statistics: @@ -25,7 +31,7 @@ We consider two sorts of statistics: - prospective values $y_{t+j}$ of a random process $\{y_t\}$ that is governed by the AR(1) process -- sample path properties that are defined as non-linear functions of future values $\{y_{t+j}\}_{j\geq 1}$ at time $t$. +- sample path properties that are defined as non-linear functions of future values $\{y_{t+j}\}_{j \geq 1}$ at time $t$. **Sample path properties** are things like "time to next turning point" or "time to next recession" @@ -54,54 +60,48 @@ logger = logging.getLogger('pymc') logger.setLevel(logging.CRITICAL) ``` -## A Univariate First-Order Autoregressive Process - +## A Univariate First-Order Autoregressive Process Consider the univariate AR(1) model: $$ y_{t+1} = \rho y_t + \sigma \epsilon_{t+1}, \quad t \geq 0 -$$ (eq1) +$$ (ar1-tp-eq1) where the scalars $\rho$ and $\sigma$ satisfy $|\rho| < 1$ and $\sigma > 0$; $\{\epsilon_{t+1}\}$ is a sequence of i.i.d. normal random variables with mean $0$ and variance $1$. -The initial condition $y_{0}$ is a known number. +The initial condition $y_{0}$ is a known number. -Equation {eq}`eq1` implies that for $t \geq 0$, the conditional density of $y_{t+1}$ is +Equation {eq}`ar1-tp-eq1` implies that for $t \geq 0$, the conditional density of $y_{t+1}$ is $$ f(y_{t+1} | y_{t}; \rho, \sigma) \sim {\mathcal N}(\rho y_{t}, \sigma^2) \ -$$ (eq2) - - -Further, equation {eq}`eq1` also implies that for $t \geq 0$, the conditional density of $y_{t+j}$ for $j \geq 1$ is +$$ (ar1-tp-eq2) +Further, equation {eq}`ar1-tp-eq1` also implies that for $t \geq 0$, the conditional density of $y_{t+j}$ for $j \geq 1$ is $$ f(y_{t+j} | y_{t}; \rho, \sigma) \sim {\mathcal N}\left(\rho^j y_{t}, \sigma^2 \frac{1 - \rho^{2j}}{1 - \rho^2} \right) -$$ (eq3) +$$ (ar1-tp-eq3) - -The predictive distribution {eq}`eq3` that assumes that the parameters $\rho, \sigma$ are known, which we express +The predictive distribution {eq}`ar1-tp-eq3` that assumes that the parameters $\rho, \sigma$ are known, which we express by conditioning on them. We also want to compute a predictive distribution that does not condition on $\rho,\sigma$ but instead takes account of our uncertainty about them. -We form this predictive distribution by integrating {eq}`eq3` with respect to a joint posterior distribution $\pi_t(\rho,\sigma | y^t )$ +We form this predictive distribution by integrating {eq}`ar1-tp-eq3` with respect to a joint posterior distribution $\pi_t(\rho,\sigma | y^t)$ that conditions on an observed history $y^t = \{y_s\}_{s=0}^t$: $$ f(y_{t+j} | y^t) = \int f(y_{t+j} | y_{t}; \rho, \sigma) \pi_t(\rho,\sigma | y^t ) d \rho d \sigma -$$ (eq4) - - +$$ (ar1-tp-eq4) -Predictive distribution {eq}`eq3` assumes that parameters $(\rho,\sigma)$ are known. +Predictive distribution {eq}`ar1-tp-eq3` assumes that parameters $(\rho,\sigma)$ are known. -Predictive distribution {eq}`eq4` assumes that parameters $(\rho,\sigma)$ are uncertain, but have known probability distribution $\pi_t(\rho,\sigma | y^t )$. +Predictive distribution {eq}`ar1-tp-eq4` assumes that parameters $(\rho,\sigma)$ are uncertain, but have known probability distribution $\pi_t(\rho,\sigma | y^t )$. -We also want to compute some predictive distributions of "sample path statistics" that might include, for example +We also want to compute some predictive distributions of "sample path statistics" that might include, for example - the time until the next "recession", - the minimum value of $Y$ over the next 8 periods, @@ -115,20 +115,16 @@ To accomplish that for situations in which we are uncertain about parameter valu - for each draw $n=0,1,...,N$, simulate a "future path" of length $T_1$ with parameters $\left(\rho_n,\sigma_n\right)$ and compute our three "sample path statistics"; - finally, plot the desired statistics from the $N$ samples as an empirical distribution. - - ## Implementation First, we'll simulate a sample path from which to launch our forecasts. In addition to plotting the sample path, under the assumption that the true parameter values are known, we'll plot $.9$ and $.95$ coverage intervals using conditional distribution -{eq}`eq3` described above. +{eq}`ar1-tp-eq3` described above. We'll also plot a bunch of samples of sequences of future values and watch where they fall relative to the coverage interval. - - ```{code-cell} ipython3 def AR1_simulate(rho, sigma, y0, T): @@ -198,21 +194,21 @@ Wecker {cite}`wecker1979predicting` proposed using simulation techniques to char He called these functions "path properties" to contrast them with properties of single data points. -He studied two special prospective path properties of a given series $\{y_t\}$. +He studied two special prospective path properties of a given series $\{y_t\}$. -The first was **time until the next turning point** +The first was **time until the next turning point** - * he defined a **"turning point"** to be the date of the second of two successive declines in $y$. +* he defined a **"turning point"** to be the date of the second of two successive declines in $y$. -To examine this statistic, let $Z$ be an indicator process +To examine this statistic, let $Z$ be an indicator process -$$ + Then the random variable **time until the next turning point** is defined as the following **stopping time** with respect to $Z$: @@ -220,8 +216,8 @@ $$ W_t(\omega):= \inf \{ k\geq 1 \mid Z_{t+k}(\omega) = 1\} $$ -Wecker {cite}`wecker1979predicting` also studied **the minimum value of $Y$ over the next 8 quarters** -which can be defined as the random variable +Wecker {cite}`wecker1979predicting` also studied **the minimum value of $Y$ over the next 8 quarters** +which can be defined as the random variable $$ M_t(\omega) := \min \{ Y_{t+1}(\omega); Y_{t+2}(\omega); \dots; Y_{t+8}(\omega)\} @@ -230,36 +226,34 @@ $$ It is interesting to study yet another possible concept of a **turning point**. Thus, let - + Define a **positive turning point today or tomorrow** statistic as -$$ + This is designed to express the event - - ``after one or two decrease(s), $Y$ will grow for two consecutive quarters'' - +- ``after one or two decrease(s), $Y$ will grow for two consecutive quarters'' Following {cite}`wecker1979predicting`, we can use simulations to calculate probabilities of $P_t$ and $N_t$ for each period $t$. ## A Wecker-Like Algorithm - The procedure consists of the following steps: * index a sample path by $\omega_i$ @@ -270,11 +264,9 @@ $$ Y(\omega_i) = \left\{ Y_{t+1}(\omega_i), Y_{t+2}(\omega_i), \dots, Y_{t+N}(\omega_i)\right\}_{i=1}^I $$ -* for each path $\omega_i$, compute the associated value of $W_t(\omega_i), W_{t+1}(\omega_i), \dots$ +* for each path $\omega_i$, compute the associated value of $W_t(\omega_i), W_{t+1}(\omega_i), \dots$ -* consider the sets $ -\{W_t(\omega_i)\}^{T}_{i=1}, \ \{W_{t+1}(\omega_i)\}^{T}_{i=1}, \ \dots, \ \{W_{t+N}(\omega_i)\}^{T}_{i=1} -$ as samples from the predictive distributions $f(W_{t+1} \mid \mathcal y_t, \dots)$, $f(W_{t+2} \mid y_t, y_{t-1}, \dots)$, $\dots$, $f(W_{t+N} \mid y_t, y_{t-1}, \dots)$. +* consider the sets $\{W_t(\omega_i)\}^{T}_{i=1}, \ \{W_{t+1}(\omega_i)\}^{T}_{i=1}, \ \dots, \ \{W_{t+N}(\omega_i)\}^{T}_{i=1}$ as samples from the predictive distributions $f(W_{t+1} \mid \mathcal y_t, \dots)$, $f(W_{t+2} \mid y_t, y_{t-1}, \dots)$, $\dots$, $f(W_{t+N} \mid y_t, y_{t-1}, \dots)$. ## Using Simulations to Approximate a Posterior Distribution @@ -283,7 +275,6 @@ The next code cells use `pymc` to compute the time $t$ posterior distribution of Note that in defining the likelihood function, we choose to condition on the initial value $y_0$. - ```{code-cell} ipython3 def draw_from_posterior(sample): """ @@ -328,7 +319,6 @@ The graphs on the left portray posterior marginal distributions. ## Calculating Sample Path Statistics - Our next step is to prepare Python codeto compute our sample path statistics. ```{code-cell} ipython3 @@ -455,9 +445,9 @@ plt.show() ## Extended Wecker Method Now we apply we apply our "extended" Wecker method based on predictive densities of $y$ defined by -{eq}`eq4` that acknowledge posterior uncertainty in the parameters $\rho, \sigma$. +{eq}`ar1-tp-eq4` that acknowledge posterior uncertainty in the parameters $\rho, \sigma$. -To approximate the intergration on the right side of {eq}`eq4`, we repeately draw parameters from the joint posterior distribution each time we simulate a sequence of future values from model {eq}`eq1`. +To approximate the intergration on the right side of {eq}`ar1-tp-eq4`, we repeately draw parameters from the joint posterior distribution each time we simulate a sequence of future values from model {eq}`ar1-tp-eq1`. ```{code-cell} ipython3 def plot_extended_Wecker(post_samples, initial_path, N, ax): @@ -525,4 +515,3 @@ plot_extended_Wecker(post_samples, initial_path, 1000, ax) plt.legend() plt.show() ``` - diff --git a/lectures/bayes_nonconj.md b/lectures/bayes_nonconj.md index f92c1ff07..2d17b6f90 100644 --- a/lectures/bayes_nonconj.md +++ b/lectures/bayes_nonconj.md @@ -888,7 +888,7 @@ SVI_num_steps = 5000 true_theta = 0.8 ``` -#### Beta Prior and Posteriors +### Beta Prior and Posteriors: Let's compare outcomes when we use a Beta prior. @@ -944,7 +944,7 @@ Here the MCMC approximation looks good. But the VI approximation doesn't look so good. - * even though we use the beta distribution as our guide, the VI approximated posterior distributions do not closely resemble the posteriors that we had just computed analytically. +* even though we use the beta distribution as our guide, the VI approximated posterior distributions do not closely resemble the posteriors that we had just computed analytically. (Here, our initial parameter for Beta guide is (0.5, 0.5).) @@ -960,8 +960,6 @@ BayesianInferencePlot(true_theta, num_list, BETA_numpyro).SVI_plot(guide_dist='b ``` - - ## Non-conjugate Prior Distributions Having assured ourselves that our MCMC and VI methods can work well when we have conjugate prior and so can also compute analytically, we diff --git a/lectures/lagrangian_lqdp.md b/lectures/lagrangian_lqdp.md index 1ab5e3442..1ce8eba87 100644 --- a/lectures/lagrangian_lqdp.md +++ b/lectures/lagrangian_lqdp.md @@ -160,7 +160,7 @@ For the undiscounted optimal linear regulator problem, form the Lagrangian $$ {\cal L} = - \sum^\infty_{t=0} \biggl\{ x^\prime_t R x_t + u_t^\prime Q u_t + 2 \mu^\prime_{t+1} [A x_t + B u_t - x_{t+1}]\biggr\} -$$ (eq1) +$$ (lag-lqdp-eq1) where $2 \mu_{t+1}$ is a vector of Lagrange multipliers on the time $t$ transition law $x_{t+1} = A x_t + B u_t$. @@ -172,16 +172,16 @@ $$ \begin{aligned} 2 Q u_t &+ 2B^\prime \mu_{t+1} = 0 \ ,\ t \geq 0 \cr \mu_t &= R x_t + A^\prime \mu_{t+1}\ ,\ t\geq 1.\cr \end{aligned} -$$ (eq2) +$$ (lag-lqdp-eq2) -Define $\mu_0$ to be a vector of shadow prices of $x_0$ and apply an envelope condition to {eq}`eq1` +Define $\mu_0$ to be a vector of shadow prices of $x_0$ and apply an envelope condition to {eq}`lag-lqdp-eq1` to deduce that $$ \mu_0 = R x_0 + A' \mu_1, $$ -which is a time $t=0 $ counterpart to the second equation of system {eq}`eq2`. +which is a time $t=0 $ counterpart to the second equation of system {eq}`lag-lqdp-eq2`. An important fact is that @@ -199,11 +199,11 @@ corresponds to the **state** vector $x_t$. It is useful to proceed with the following steps: -* solve the first equation of {eq}`eq2` for $u_t$ in terms of $\mu_{t+1}$. +* solve the first equation of {eq}`lag-lqdp-eq2` for $u_t$ in terms of $\mu_{t+1}$. * substitute the result into the law of motion $x_{t+1} = A x_t + B u_t$. -* arrange the resulting equation and the second equation of {eq}`eq2` into the form +* arrange the resulting equation and the second equation of {eq}`lag-lqdp-eq2` into the form $$ L\ \begin{pmatrix}x_{t+1}\cr \mu_{t+1}\cr\end{pmatrix}\ = \ N\ \begin{pmatrix}x_t\cr \mu_t\cr\end{pmatrix}\ @@ -271,7 +271,7 @@ The rank of $J$ is $2n$. $$ MJM^\prime = J. -$$ (eq3) +$$ (lag-lqdp-eq3) Salient properties of symplectic matrices that are readily verified include: @@ -280,14 +280,14 @@ Salient properties of symplectic matrices that are readily verified include: It can be verified directly that $M$ in equation {eq}`Mdefn` is symplectic. -It follows from equation {eq}`eq3` and from the fact $J^{-1} = J^\prime = -J$ that for any symplectic +It follows from equation {eq}`lag-lqdp-eq3` and from the fact $J^{-1} = J^\prime = -J$ that for any symplectic matrix $M$, $$ M^\prime = J^{-1} M^{-1} J. -$$ (eq4) +$$ (lag-lqdp-eq4) -Equation {eq}`eq4` states that $M^\prime$ is related to the inverse of $M$ +Equation {eq}`lag-lqdp-eq4` states that $M^\prime$ is related to the inverse of $M$ by a **similarity transformation**. For square matrices, recall that @@ -298,7 +298,7 @@ For square matrices, recall that * a matrix and its transpose share eigenvalues -It then follows from equation {eq}`eq4` that +It then follows from equation {eq}`lag-lqdp-eq4` that the eigenvalues of $M$ occur in reciprocal pairs: if $\lambda$ is an eigenvalue of $M$, so is $\lambda^{-1}$. @@ -809,7 +809,7 @@ $$ which is a time $t=0 $ counterpart to the second equation of system {eq}`eq662`. -Proceeding as we did above with the undiscounted system {eq}`eq2`, we can rearrange the first-order conditions into the +Proceeding as we did above with the undiscounted system {eq}`lag-lqdp-eq2`, we can rearrange the first-order conditions into the system $$ @@ -821,7 +821,7 @@ $$ \left[\begin{matrix} x_t \cr \mu_t \end{matrix}\right] $$ (eq663) -which in the special case that $\beta = 1$ agrees with equation {eq}`eq2`, as expected. +which in the special case that $\beta = 1$ agrees with equation {eq}`lag-lqdp-eq2`, as expected. +++ diff --git a/lectures/prob_matrix.md b/lectures/prob_matrix.md index f2fdaa784..4a511b657 100644 --- a/lectures/prob_matrix.md +++ b/lectures/prob_matrix.md @@ -19,17 +19,17 @@ After providing somewhat informal definitions of the underlying objects, we'll u Among concepts that we'll be studying include - - a joint probability distribution - - marginal distributions associated with a given joint distribution - - conditional probability distributions - - statistical independence of two random variables - - joint distributions associated with a prescribed set of marginal distributions - - couplings - - copulas - - the probability distribution of a sum of two independent random variables - - convolution of marginal distributions - - parameters that define a probability distribution - - sufficient statistics as data summaries +- a joint probability distribution +- marginal distributions associated with a given joint distribution +- conditional probability distributions +- statistical independence of two random variables +- joint distributions associated with a prescribed set of marginal distributions + - couplings + - copulas +- the probability distribution of a sum of two independent random variables + - convolution of marginal distributions +- parameters that define a probability distribution +- sufficient statistics as data summaries We'll use a matrix to represent a bivariate probability distribution and a vector to represent a univariate probability distribution @@ -57,13 +57,8 @@ We'll briefly define what we mean by a **probability space**, a **probability me For most of this lecture, we sweep these objects into the background, but they are there underlying the other objects that we'll mainly focus on. - - - Let $\Omega$ be a set of possible underlying outcomes and let $\omega \in \Omega$ be a particular underlying outcomes. - - Let $\mathcal{G} \subset \Omega$ be a subset of $\Omega$. Let $\mathcal{F}$ be a collection of such subsets $\mathcal{G} \subset \Omega$. @@ -72,7 +67,7 @@ The pair $\Omega,\mathcal{F}$ forms our **probability space** on which we want A **probability measure** $\mu$ maps a set of possible underlying outcomes $\mathcal{G} \in \mathcal{F}$ into a scalar number between $0$ and $1$ - - this is the "probability" that $X$ belongs to $A$, denoted by $ \textrm{Prob}\{X\in A\}$. +- this is the "probability" that $X$ belongs to $A$, denoted by $ \textrm{Prob}\{X\in A\}$. A **random variable** $X(\omega)$ is a function of the underlying outcome $\omega \in \Omega$. @@ -89,8 +84,6 @@ where ${\mathcal G}$ is the subset of $\Omega$ for which $X(\omega) \in A$. We call this the induced probability distribution of random variable $X$. - - ## Digression: What Does Probability Mean? Before diving in, we'll say a few words about what probability theory means and how it connects to statistics. @@ -103,7 +96,6 @@ These are purely mathematical objects. To appreciate how statisticians connect probabilities to data, the key is to understand the following concepts: - * A single draw from a probability distribution * Repeated independently and identically distributed (i.i.d.) draws of "samples" or "realizations" from the same probability distribution * A **statistic** defined as a function of a sequence of samples @@ -111,7 +103,7 @@ To appreciate how statisticians connect probabilities to data, the key is to und * The idea that a population probability distribution is what we anticipate **relative frequencies** will be in a long sequence of i.i.d. draws. Here the following mathematical machinery makes precise what is meant by **anticipated relative frequencies** - **Law of Large Numbers (LLN)** - **Central Limit Theorem (CLT)** -+++ + **Scalar example** @@ -136,41 +128,38 @@ $$ \end{aligned} $$ - - Consider the **empirical distribution**: -\begin{align} +$$ +\begin{aligned} i & = 0,\dots,I-1,\\ N_i & = \text{number of times} \ X = i,\\ N & = \sum^{I-1}_{i=0} N_i \quad \text{total number of draws},\\ \tilde {f_i} & = \frac{N_i}{N} \sim \ \text{frequency of draws for which}\ X=i -\end{align} +\end{aligned} +$$ Key ideas that justify connecting probability theory with statistics are laws of large numbers and central limit theorems **LLN:** - - A Law of Large Numbers (LLN) states that $\tilde {f_i} \to f_i \text{ as } N \to \infty$ +- A Law of Large Numbers (LLN) states that $\tilde {f_i} \to f_i \text{ as } N \to \infty$ **CLT:** - - A Central Limit Theorem (CLT) describes a **rate** at which $\tilde {f_i} \to f_i$ +- A Central Limit Theorem (CLT) describes a **rate** at which $\tilde {f_i} \to f_i$ **Remarks** - - For "frequentist" statisticians, **anticipated relative frequency** is **all** that a probability distribution means. - - - But for a Bayesian it means something more or different. +- For "frequentist" statisticians, **anticipated relative frequency** is **all** that a probability distribution means. +- But for a Bayesian it means something more or different. ## Representing Probability Distributions - - A probability distribution $\textrm{Prob} (X \in A)$ can be described by its **cumulative distribution function (CDF)** $$ @@ -194,9 +183,9 @@ When a probability density exists, a probability distribution can be characteriz For a **discrete-valued** random variable - * the number of possible values of $X$ is finite or countably infinite - * we replace a **density** with a **probability mass function**, a non-negative sequence that sums to one - * we replace integration with summation in the formula like {eq}`eq:CDFfromdensity` that relates a CDF to a probability mass function +* the number of possible values of $X$ is finite or countably infinite +* we replace a **density** with a **probability mass function**, a non-negative sequence that sums to one +* we replace integration with summation in the formula like {eq}`eq:CDFfromdensity` that relates a CDF to a probability mass function In this lecture, we mostly discuss discrete random variables. @@ -204,9 +193,7 @@ In this lecture, we mostly discuss discrete random variables. Doing this enables us to confine our tool set basically to linear algebra. Later we'll briefly discuss how to approximate a continuous random variable with a discrete random variable. - -+++ ## Univariate Probability Distributions @@ -259,11 +246,12 @@ where $\theta $ is a vector of parameters that is of much smaller dimension than **Remarks:** - - The concept of **parameter** is intimately related to the notion of **sufficient statistic**. - - Sufficient statistic are nonlinear function of a data set. - - Sufficient statistics are designed to summarize all **information** about the parameters that is contained in the big data set. - - They are important tools that AI uses to reduce the size of a **big data** set - - R. A. Fisher provided a sharp definition of **information** -- see + +- The concept of **parameter** is intimately related to the notion of **sufficient statistic**. +- Sufficient statistic are nonlinear function of a data set. +- Sufficient statistics are designed to summarize all **information** about the parameters that is contained in the big data set. +- They are important tools that AI uses to reduce the size of a **big data** set +- R. A. Fisher provided a sharp definition of **information** -- see @@ -283,8 +271,6 @@ $$ f_i( \theta)\ge0, \sum_{i=0}^{\infty}f_i(\theta)=1 $$ -+++ - ### Continuous random variable Let $X$ be a continous random variable that takes values $X \in \tilde{X}\equiv[X_U,X_L]$ whose distributions have parameters $\theta$. @@ -299,15 +285,12 @@ $$ \textrm{Prob}\{X\in \tilde{X}\} =1 $$ -+++ - ## Bivariate Probability Distributions We'll now discuss a bivariate **joint distribution**. To begin, we restrict ourselves to two discrete random variables. - Let $X,Y$ be two discrete random variables that take values: $$ @@ -335,7 +318,6 @@ where $$ \sum_{i}\sum_{j}f_{ij}=1 $$ -+++ ## Marginal Probability Distributions @@ -349,8 +331,6 @@ $$ \textrm{Prob}\{Y=j\}= \sum_{i=0}^{I-1}f_{ij} = \nu_i, i=0,\ldots,J-1 $$ - - For example, let the joint distribution over $(X,Y)$ be $$ @@ -362,21 +342,18 @@ F = \left[ \right] $$ (eq:example101discrete) - Then marginal distributions are: - $$ -\begin{aligned} \textrm{Prob} \{X=0\}&=.25+.1=.35\\ +\begin{aligned} +\textrm{Prob} \{X=0\}&=.25+.1=.35\\ \textrm{Prob}\{X=1\}& =.15+.5=.65\\ \textrm{Prob}\{Y=0\}&=.25+.15=.4\\ \textrm{Prob}\{Y=1\}&=.1+.5=.6 \end{aligned} $$ - - -**Digression:** If two random variables $X,Y$ are continuous and have joint density $f(x,y)$, then marginal distributions can be computed by +**Digression:** If two random variables $X,Y$ are continuous and have joint density $f(x,y)$, then marginal distributions can be computed by $$ \begin{aligned} @@ -385,8 +362,6 @@ f(y)& = \int_{\mathbb{R}} f(x,y) dx \end{aligned} $$ -+++ - ## Conditional Probability Distributions Conditional probabilities are defined according to @@ -425,8 +400,6 @@ $$ \textrm{Prob}\{X=0|Y=1\} =\frac{ .1}{.1+.5}=\frac{.1}{.6} $$ -+++ - ## Statistical Independence Random variables X and Y are statistically **independent** if @@ -438,28 +411,24 @@ $$ where $$ -\begin{align} +\begin{aligned} \textrm{Prob}\{X=i\} &=f_i\ge0, \sum{f_i}=1 \cr \textrm{Prob}\{Y=j\} & =g_j\ge0, \sum{g_j}=1 -\end{align} +\end{aligned} $$ Conditional distributions are $$ -\begin{align} +\begin{aligned} \textrm{Prob}\{X=i|Y=j\} & =\frac{f_ig_i}{\sum_{i}f_ig_j}=\frac{f_ig_i}{g_i}=f_i \\ \textrm{Prob}\{Y=j|X=i\} & =\frac{f_ig_i}{\sum_{j}f_ig_j}=\frac{f_ig_i}{f_i}=g_i -\end{align} +\end{aligned} $$ -+++ ## Means and Variances -+++ - - The mean and variance of a discrete random variable $X$ are $$ @@ -480,10 +449,6 @@ $$ $$ - - - - ## Classic Trick for Generating Random Numbers Suppose we have at our disposal a pseudo random number that draws a uniform random variable, i.e., one with probability distribution @@ -526,7 +491,7 @@ Thus, suppose that It turns out that if we use draw uniform random numbers $U$ and then compute $X$ from $$ -X=F^{-1}(U)$, +X=F^{-1}(U), $$ then $X$ ia a random variable with CDF $F_X(x)=F(x)=\textrm{Prob}\{X\le x\}$. @@ -613,19 +578,23 @@ plt.show() Let $X$ distributed geometrically, that is -\begin{align} +$$ +\begin{aligned} \textrm{Prob}(X=i) & =(1-\lambda)\lambda^i,\quad\lambda\in(0,1), \quad i=0,1,\dots \\ & \sum_{i=0}^{\infty}\textrm{Prob}(X=i)=1\longleftrightarrow(1- \lambda)\sum_{i=0}^{\infty}\lambda^i=\frac{1-\lambda}{1-\lambda}=1 -\end{align} +\end{aligned} +$$ Its CDF is given by -\begin{align} +$$ +\begin{aligned} \textrm{Prob}(X\le i)& =(1-\lambda)\sum_{j=0}^{i}\lambda^i\\ & =(1-\lambda)[\frac{1-\lambda^{i+1}}{1-\lambda}]\\ & =1-\lambda^{i+1}\\ & =F(X)=F_i \quad -\end{align} +\end{aligned} +$$ Again, let $\tilde{U}$ follow a uniform distribution and we want to find $X$ such that $F(X)=\tilde{U}$. @@ -688,9 +657,7 @@ plt.hist(x_g, bins=150, density=True, alpha=0.6) plt.show() ``` - - -## Some Discrete Probability Distributions +## Some Discrete Probability Distributions Let's write some Python code to compute means and variances of soem univariate random variables. @@ -703,11 +670,17 @@ We'll use our code to ## Geometric distribution -$$ \textrm{Prob}(X=k)=(1-p)^{k-1}p,k=1,2, \ldots $$ +$$ +\textrm{Prob}(X=k)=(1-p)^{k-1}p,k=1,2, \ldots +$$ + $\implies$ -$$\begin{align} -\mathbb{E}(X) & =\frac{1}{p}\\\mathbb{D}(X) & =\frac{1-p}{p^2} \end{align}$$ +$$ +\begin{aligned} +\mathbb{E}(X) & =\frac{1}{p}\\\mathbb{D}(X) & =\frac{1-p}{p^2} +\end{aligned} +$$ We draw observations from the distribution and compare the sample mean and variance with the theoretical results. @@ -797,7 +770,8 @@ Its distribution is $$ \begin{aligned} X & \sim NB(r,p) \\ -\textrm{Prob}(X=k;r,p) & = \begin{pmatrix}k+r-1 \\ r-1 \end{pmatrix}p^r(1-p)^{k} \end{aligned} +\textrm{Prob}(X=k;r,p) & = \begin{pmatrix}k+r-1 \\ r-1 \end{pmatrix}p^r(1-p)^{k} +\end{aligned} $$ Here, we choose from among $k+r-1$ possible outcomes because the last draw is by definition a success. @@ -928,8 +902,6 @@ $$ f(x) = 0.0005 $$ -+++ - Let's start by generating a random sample and computing sample moments. ```{code-cell} ipython3 @@ -969,19 +941,10 @@ print("mean: ", mean) print("variance: ", var) ``` - - - - ## Matrix Representation of Some Bivariate Distributions -+++ - Let's use matrices to represent a joint distribution, conditional distribution, marginal distribution, and the mean and variance of a bivariate random variable. -+++ - - The table below illustrates a probability distribution for a bivariate random variable. $$ @@ -996,13 +959,9 @@ Marginal distributions are $$ \textrm{Prob}(X=i)=\sum_j{f_{ij}}=u_i $$ $$ \textrm{Prob}(Y=j)=\sum_i{f_{ij}}=v_j $$ - - Below we draw some samples confirm that the "sampling" distribution agrees well with the "population" distribution. -+++ - -#### Sample results +**Sample results:** ```{code-cell} ipython3 # specify parameters @@ -1129,9 +1088,6 @@ $$ These population objects closely resemble sample counterparts computed above. -+++ - - Let's wrap some of the functions we have used in a Python class for a general discrete bivariate joint distribution. ```{code-cell} ipython3 @@ -1287,9 +1243,6 @@ d_new.marg_dist() d_new.cond_dist() ``` - -+++ - ## A Continuous Bivariate Random Vector @@ -1301,9 +1254,7 @@ $$ $$ -\begin{equation} \frac{1}{2\pi\sigma_1\sigma_2\sqrt{1-\rho^2}}\exp\left[-\frac{1}{2(1-\rho^2)}\left(\frac{(x-\mu_1)^2}{\sigma_1^2}-\frac{2\rho(x-\mu_1)(y-\mu_2)}{\sigma_1\sigma_2}+\frac{(y-\mu_2)^2}{\sigma_2^2}\right)\right] -\end{equation} $$ We start with a bivariate normal distribution pinned down by @@ -1407,9 +1358,9 @@ plt.show() The population conditional distribution is $$ -\begin{aligned} -[X|Y & =y ]\sim \mathbb{N}\bigg[\mu_X+\rho\sigma_X\frac{y-\mu_Y}{\sigma_Y},\sigma_X^2(1-\rho^2)\bigg] \\ -[Y|X= &x ]\sim \mathbb{N}\bigg[\mu_Y+\rho\sigma_Y\frac{x-\mu_X}{\sigma_X},\sigma_Y^2(1-\rho^2)\bigg] +\begin{aligned} \\ +[X|Y &= y ]\sim \mathbb{N}\bigg[\mu_X+\rho\sigma_X\frac{y-\mu_Y}{\sigma_Y},\sigma_X^2(1-\rho^2)\bigg] \\ +[Y|X &= x ]\sim \mathbb{N}\bigg[\mu_Y+\rho\sigma_Y\frac{x-\mu_X}{\sigma_X},\sigma_Y^2(1-\rho^2)\bigg] \end{aligned} $$ @@ -1494,7 +1445,8 @@ Define a new random variable $Z=X+Y$. Evidently, $Z$ takes values from $\bar{Z}$ defined as follows: $$ -\begin{aligned} \bar{X} & =\{0,1,\ldots,I-1\};\qquad f_i= \textrm{Prob} \{X=i\}\\ +\begin{aligned} +\bar{X} & =\{0,1,\ldots,I-1\};\qquad f_i= \textrm{Prob} \{X=i\}\\ \bar{Y} & =\{0,1,\ldots,J-1\};\qquad g_j= \textrm{Prob}\{Y=j\}\\ \bar{Z}& =\{0,1,\ldots,I+J-2\};\qquad h_k= \textrm{Prob} \{X+Y=k\} \end{aligned} @@ -1523,11 +1475,7 @@ $$ f_{Z}(z)=\int_{-\infty}^{\infty} f_{X}(x) f_{Y}(z-x) dx \equiv f_{X}*g_{Y} $$ -where $ f_{X}*g_{Y}$ denotes the **convolution** of the $f_X$ and $g_Y$ functions. - - -+++ - +where $ f_{X}*g_{Y} $ denotes the **convolution** of the $f_X$ and $g_Y$ functions. ## Transition Probability Matrix @@ -1538,6 +1486,7 @@ Let $X,Y$ be discrete random variables with joint distribution $$ \textrm{Prob}\{X=i,Y=j\} = \rho_{ij} $$ + where $i = 0,\dots,I-1; j = 0,\dots,J-1$ and $$ @@ -1551,11 +1500,8 @@ $$ = \frac{\textrm{Prob}\{Y=j, X=i\}}{\textrm{Prob}\{ X=i\}} $$ -+++ - We can define a transition probability matrix - $$ p_{ij}=\textrm{Prob}\{Y=j|X=i\}= \frac{\rho_{ij}}{ \sum_{j}\rho_{ij}} $$ @@ -1578,11 +1524,8 @@ The second row is the probability of $Y=j, j=0,1$ conditional on $X=1$. Note that - $\sum_{j}\rho_{ij}= \frac{ \sum_{j}\rho_{ij}}{ \sum_{j}\rho_{ij}}=1$, so each row of $\rho$ is a probability distribution (not so for each column. -+++ - ## Coupling - Start with a joint distribution $$ @@ -1614,8 +1557,6 @@ We'll find that from two marginal distributions, can we usually construct more t Each of these joint distributions is called a **coupling** of the two martingal distributions. -+++ - Let's start with marginal distributions $$ @@ -1641,9 +1582,7 @@ $$ \end{aligned} $$ -+++ - -We construct two couplings. +We construct two couplings. The first coupling if our two marginal distributions is the joint distribution @@ -1668,8 +1607,6 @@ $$ \end{aligned} $$ -+++ - A second coupling of our two marginal distributions is the joint distribution @@ -1704,15 +1641,13 @@ Thus, multiple joint distributions $[f_{ij}]$ can have the same marginals. **Remark:** - Couplings are important in optimal transport problems and in Markov processes. -+++ - ## Copula Functions Suppose that $X_1, X_2, \dots, X_n$ are $N$ random variables and that - * their marginal distributions are $F_1(x_1), F_2(x_2),\dots, F_N(x_N)$, and - - * their joint distribution is $H(x_1,x_2,\dots,x_N)$ +* their marginal distributions are $F_1(x_1), F_2(x_2),\dots, F_N(x_N)$, and + +* their joint distribution is $H(x_1,x_2,\dots,x_N)$ Then there exists a **copula function** $C(\cdot)$ that verifies @@ -1734,8 +1669,6 @@ Thus, for given marginal distributions, we can use a copula function to determi Copula functions are often used to characterize **dependence** of random variables. -+++ - **Discrete marginal distribution** TOM -- REWRITE OR MAYBE DROP PARTS OF @@ -1962,8 +1895,6 @@ We have verified that both joint distributions, $c_1$ and $c_2$, have identical So they are both couplings of $X$ and $Y$. -+++ - ## Time Series Suppose that there are two time periods. @@ -1976,10 +1907,10 @@ Let $X(0)$ be a random variable to be realized at $t=0$, $X(1)$ be a random var Suppose that $$ -\begin{align} +\begin{aligned} \textrm{\textrm{Prob}}\{X(0)=i,X(1)=j\} &=f_{ij}≥0,i=0,……,I-1\\ \sum_{i}\sum_{j}f_{ij}&=1 -\end{align} +\end{aligned} $$ $f_{ij} $ is a joint distribution over $[X(0), X(1)]$.