# Basic Asymptotics

<!-- Qingliang covers basic asymptotic theory -->




## Modes of Convergence

Let $x_{1},x_{2},\ldots$ be an infinite
sequence of non-random variables.
*Convergence* of this non-random sequence means that for any
$\varepsilon>0$, there exists an $N\left(\varepsilon\right)$ such that
for all $n>N\left(\varepsilon\right)$, we have
$\left|x_{n}-x\right|<\varepsilon$. We say $x$ is the limit of $x_{n}$,
and write $x_{n}\to x$ or $\lim_{n\to\infty}x_{n}=x$.



Instead of a deterministic sequence, in this course we are interested in the
convergence of a sequence of random variables. Since a random variable
is “random” thanks to the induced probability measure by the measurable
function, we must be clear what *convergence* means in this context. Several modes of
convergence are widely used.



We say a sequence of random variables $\left(x_{n}\right)$ converges in
probability to $x$, where $x$ can be either a random variable or a
non-random constant, if for any $\varepsilon>0$, the probability

$$
P\left\{ \left|x_{n} -x\right|<\varepsilon\right\} \to1
$$

(or equivalently
$P\left\{ \left|x_{n} -x\right|\geq\varepsilon\right\} \to0$)
as $n\to\infty$. We can write $x_{n}\stackrel{p}{\to}x$ or
$\mathrm{plim}_{n\to\infty}x_{n}=x$.



A sequence of random variables $\left(x_{n}\right)$ *converges in
squared-mean* to $x$, where $x$ can be either a random variable or a
non-random constant, if $E\left[\left(x_{n}-x\right)^{2}\right]\to0.$ It
is denoted as $x_{n}\stackrel{m.s.}{\to}x$.

In these definitions either
$P\left\{ \left|x_{n}-x\right|>\varepsilon\right\}$
or $E\left[\left(x_{n}-x\right)^{2}\right]$ is a non-random quantity,
and it thus converges to 0 as a non-random sequence.

Squared-mean convergence is stronger than convergence in probability.
That is, $x_{n}\stackrel{m.s.}{\to}x$ implies $x_{n}\stackrel{p}{\to}x$
but the converse is untrue. Here is an example.


**Example**

$(x_{n})$ is a sequence
of binary random variables: $x_{n}=\sqrt{n}$ with probability $1/n$, and
$x_{n}=0$ with probability $1-1/n$. Then $x_{n}\stackrel{p}{\to}0$ but
$x_{n}\stackrel{m.s.}{\nrightarrow}0$. To verify these claims, notice
that for any $\varepsilon>0$, we have
$P\left(\left|x_{n}-0\right|<\varepsilon\right)=P\left(x_{n}=0\right)=1-1/n\rightarrow1$
and thereby $x_{n}\stackrel{p}{\to}0$. On the other hand,
$E\left[\left(x_{n}-0\right)^{2}\right]=n\cdot1/n+0\cdot(1-1/n)=1\nrightarrow0,$
so $x_{n}\stackrel{m.s.}{\nrightarrow}0$.



This example 
highlights the difference between the two modes of convergence.
Convergence in probability does not count what happens on a subset in
the sample space of small probability. Squared-mean convergence deals
with the average over the entire probability space. If a random variable
can take a wild value, with small probability though, it may blow away
the squared-mean convergence. On the contrary, such irregularity does
not destroy convergence in probability.



Both convergence in probability and squared-mean convergence are about
convergence of random variables to a target random variable or constant.
That is, the distribution of $(x_{n}-x)$ is concentrated around 0 as
$n\to\infty$. Instead, *convergence in distribution* is about the
convergence of CDF, but not the random variable. Let
$F_{x_{n}}\left(\cdot\right)$ be the CDF of $x_{n}$ and
$F_{x}\left(\cdot\right)$ be the CDF of $x$.



We say a sequence of random variables $\left(x_{n}\right)$ converges in
distribution to a random variable $x$ if
$F_{x_{n}}\left(a\right)\to F_{x}\left(a\right)$ as $n\to\infty$ at each
point $a\in\mathbb{R}$ where $F_{x}\left(\cdot\right)$ is
continuous. We write $x_{n}\stackrel{d}{\to}x$.



Convergence in distribution is the weakest mode. If
$x_{n}\stackrel{p}{\to}x$, then $x_{n}\stackrel{d}{\to}x$. The converse
is not true in general, unless $x$ is a non-random constant (A constant
$x$ can be viewed as a degenerate random variables, with a corresponding
“CDF” $F_{x}\left(\cdot\right)=1\left\{ \cdot\geq x\right\}$.)




Let $x\sim N\left(0,1\right)$. If $x_{n}=x+1/n$, then
$x_{n}\stackrel{p}{\to}x$ and of course $x_{n}\stackrel{d}{\to}x$.
However, if $x_{n}=-x+1/n$, or $x_{n}=y+1/n$ where
$y\sim N\left(0,1\right)$ is independent of $x$, then
$x_{n}\stackrel{d}{\to}x$ but $x_{n}\stackrel{p}{\nrightarrow}x$.

$(x_{n})$ is a sequence of binary random variables: $x_{n}=n$ with
probability $1/\sqrt{n}$, and $x_{n}=0$ with probability $1-1/\sqrt{n}$.
Then $x_{n}\stackrel{d}{\to}x=0.$ Because

$$
F_{x_{n}}\left(a\right)=\begin{cases}
0 & a<0\\
1-1/\sqrt{n} & 0\leq a\leq n\\
1 & a\geq n
\end{cases}.
$$

$F_{x}\left(a\right)=\begin{cases} 0, & a<0\\ 1 & a\geq0 \end{cases}$.
It is easy to verify that $F_{x_{n}}\left(a\right)$ converges to
$F_{x}\left(a\right)$ *pointwisely* on each point in
$\left(-\infty,0\right)\cup\left(0,+\infty\right)$, where
$F_{x}\left(a\right)$ is continuous.

So far we have talked about convergence of scalar variables. These three
modes of converges can be easily generalixed to random vectors. In
particular, the *Cramer-Wold device* collapses a random vector into a
random vector via arbitrary linear combination. 



## Law of Large Numbers

(Weak) law of large numbers (LLN) is a collection of statements about
convergence in probability of the sample average to its population
counterpart. The basic form of LLN is:

$$
\frac{1}{n}\sum_{i=1}^{n}(x_{i}-E[x_{i}])\stackrel{p}{\to}0
$$ 

as $n\to\infty$. Various versions of LLN work under different assumptions
about some features and/or dependence of the underlying random
variables.



### Cherbyshev LLN

We illustrate LLN by the simple example of Chebyshev LLN, which can be
proved by elementary calculation. It utilizes the *Chebyshev
inequality*.

-   *Chebyshev inequality*: If a random variable $x$ has a finite second
    moment $E\left[x^{2}\right]<\infty$, then we have
    $P\left\{ \left|x\right|>\varepsilon\right\} \leq E\left[x^{2}\right]/\varepsilon^{2}$
    for any constant $\varepsilon>0$.

Show that if $r_{2}\geq r_{1}\geq1$, then
$E\left[\left|x\right|^{r_{2}}\right]<\infty$ implies
$E\left[\left|x\right|^{r_{1}}\right]<\infty.$ (Hint: use Holder’s
inequality.)

The Chebyshev inequality is a special case of the *Markov inequality*.

-   *Markov inequality*: If a random variable $x$ has a finite $r$-th
    absolute moment $E\left[\left|x\right|^{r}\right]<\infty$ for some
    $r\ge1$, then we have
    $P\left\{ \left|x\right|>\varepsilon\right\} \leq E\left[\left|x\right|^{r}\right]/\varepsilon^{r}$
    any constant $\varepsilon>0$.

It is easy to verify the Markov inequality.

$$
\begin{aligned}E\left[\left|x\right|^{r}\right] & =\int_{\left|x\right|>\varepsilon}\left|x\right|^{r}dF_{X}+\int_{\left|x\right|\leq\varepsilon}\left|x\right|^{r}dF_{X}\\
 & \geq\int_{\left|x\right|>\varepsilon}\left|x\right|^{r}dF_{X}\\
 & \geq\varepsilon^{r}\int_{\left|x\right|>\varepsilon}dF_{X}=\varepsilon^{r}P\left\{ \left|x\right|>\varepsilon\right\} .
\end{aligned}
$$ 

Rearrange the above inequality and we obtain the Markov
inequality.



Let the *partial sum* $S_{n}=\sum_{i=1}^{n}x_{i}$, where
$\mu_{i}=E\left[x_{i}\right]$ and
$\sigma_{i}^{2}=\mathrm{var}\left[x_{i}\right]$. We apply the Chebyshev
inequality to the sample mean
$x_{n}=\overline{x}-\bar{\mu}=n^{-1}\left(S_{n}-E\left[S_{n}\right]\right)$. Assume the data are iid

$$
\begin{aligned}
P\left\{ \left|x_{n}\right|\geq\varepsilon\right\}  & =P\left\{ n^{-1}\left|S_{n}-E\left[S_{n}\right]\right|\geq\varepsilon\right\} \\
 & \leq E\left[\left(n^{-1}\sum_{i=1}^{n}\left(x_{i}-\mu_{i}\right)\right)^{2}\right]/\varepsilon^{2} \\
 & =\left(n\varepsilon\right)^{-2}  E\left[\sum_{i=1}^{n}\left(x_{i}-\mu_{i}\right)^{2}\right] \\
 & = \frac{1} {n\varepsilon^{-2}} \mathrm{var}\left(x_{1}\right).
 \end{aligned}
 $$
 


This result gives the Chebyshev LLN:

-   Chebyshev LLN: If $\left(z_{1},\ldots,x_{n}\right)$ is a sample of
    iid observations, $E\left[x_{1}\right]=\mu$ , and
    $\sigma^{2}=\mathrm{var}\left[x_{1}\right]<\infty$ exists, then
    $\frac{1}{n}\sum_{i=1}^{n}x_{i}\stackrel{p}{\to}\mu.$


Another useful LLN is the *Kolmogorov LLN*. Since its derivation
requires more advanced knowledge of probability theory, we state the
result without proof.

-   Kolmogorov LLN: If $\left(x_{1},\ldots,x_{n}\right)$ is a sample of
    iid observations and $E\left[x_{1}\right]=\mu$ exists, then
    $\frac{1}{n}\sum_{i=1}^{n}x_{i}\stackrel{p}{\to}\mu$.

Compared with the Chebyshev LLN, the Kolmogorov LLN only requires the
existence of the population mean, but not any higher moments. On the
other hand, iid is essential for the Kolmogorov LLN.



Consider three distributions: standard normal $N\left(0,1\right)$,
$t\left(2\right)$ (zero mean, infinite variance), and the Cauchy
distribution (no moments exist). We plot paths of the sample average
with $n=2^{1},2^{2},\ldots,2^{20}$. We will see that the sample averages
of $N\left(0,1\right)$ and $t\left(2\right)$ converge, but that of the
Cauchy distribution does not.

### Large of Large Numbers

This script demonstrates the law of large numbers (LLN) along with the underlying assumptions.

Write a function to generate the sample mean given the sample size $n$ and the distribution.
We allow three distributions, namely, $N(0,1)$, $t(2)$ and Cauchy.

In [None]:
sample.mean = function( n, distribution ){
  # get sample mean for a given distribution
  if (distribution == "normal"){ y = rnorm( n ) } 
  else if (distribution == "t2") {y = rt(n, 2) }
  else if (distribution == "cauchy") {y = rcauchy(n) }
  return( mean(y) )
}

This function plots the sample mean over the path of geometrically increasing sample size.

In [None]:
LLN.plot = function(distribution){
  # draw the sample mean graph
  ybar = matrix(0, length(NN), 3 )
  for (rr in 1:3){
    for ( ii in 1:length(NN)){
      n = NN[ii]; ybar[ii, rr] = sample.mean(n, distribution)
    }  
  }
  matplot(ybar, type = "l", ylab = "mean", xlab = "", 
       lwd = 1, lty = 1, main = distribution)
  abline(h = 0, lty = 2)
  return(ybar)
}
# calculation
NN = 2^(1:20); set.seed(2020-10-7); par(mfrow = c(3,1))
l1 = LLN.plot("normal"); l2 = LLN.plot("t2"); l3 = LLN.plot("cauchy")


## Central Limit Theorem

The central limit theorem (CLT) is a collection of probability results
about the convergence in distribution to a stable distribution. The
limiting distribution is usually the Gaussian distribution. The basic
form of the CLT is:

-   *Under some conditions to be spelled out*, the sample average of
    *zero-mean* random variables $\left(x_{1},\ldots,x_{n}\right)$
    multiplied by $\sqrt{n}$ satisfies

    $$\frac{1}{\sqrt{n}}\sum_{i=1}^{n}x_{i}\stackrel{d}{\to}N\left(0,\sigma^{2}\right)$$
    
    as $n\to\infty$.

Various versions of CLT work under different assumptions about the
random variables. *Lindeberg-Levy CLT* is the simplest CLT.

-   If the sample $\left(x_{1},\ldots,x_{n}\right)$ is iid,
    $E\left[x_{1}\right]=0$ and
    $\mathrm{var}\left[x_{1}\right]=\sigma^{2}<\infty$, then
    $\frac{1}{\sqrt{n}}\sum_{i=1}^{n}x_{i}\stackrel{d}{\to}N\left(0,\sigma^{2}\right)$.


This is a simulated example.


**Example**:
The sample size is chosen as $2^x$, where $x=1:20$. We have the following observations.
* When the distribution is $N(0,1)$, the Chebyshev LLN works. The sample mean converges fast.
* When the distribution is $t(2)$, which has zero mean but infinite variance, the Kolmogorov LLN works. The sample mean still converges, though more slowly than the $N(0,1)$ case.
* The Cauchy distribution has no moment at any order. The sample mean does not converge no matter how large is the sample size.


In [None]:
Z_fun = function(n, distribution){
  if (distribution == "normal"){
      z = sqrt(n) * mean(rnorm(n))
	} else if (distribution == "chisq2") {
      df = 2; 
      x = rchisq(n,2)
      z = sqrt(n) * ( mean(x) - df ) / sqrt(2*df)
      }
  return (z)
}
CLT_plot = function(n, distribution){
  Rep = 10000
  ZZ = rep(0, Rep)
  for (i in 1:Rep) {ZZ[i] = Z_fun(n, distribution)}

  xbase = seq(-4.0, 4.0, length.out = 100)
  hist( ZZ, breaks = 100, freq = FALSE, 
    xlim = c( min(xbase), max(xbase) ),
    main = paste0("hist with sample size ", n) )
  lines(x = xbase, y = dnorm(xbase), col = "red")
  return (ZZ)
}

par(mfrow = c(3,1))
phist = CLT_plot(2, "chisq2")
phist = CLT_plot(10, "chisq2")
phist = CLT_plot(100, "chisq2")


## Tools for Transformations

-   Continuous mapping theorem 1: If $x_{n}\stackrel{p}{\to}a$ and
    $f\left(\cdot\right)$ is continuous at $a$, then
    $f\left(x_{n}\right)\stackrel{p}{\to}f\left(a\right)$.

-   Continuous mapping theorem 2: If $x_{n}\stackrel{d}{\to}x$ and
    $f\left(\cdot\right)$ is continuous almost surely on the support of
    $x$, then $f\left(x_{n}\right)\stackrel{d}{\to}f\left(x\right)$.

-   Slutsky’s theorem: If $x_{n}\stackrel{d}{\to}x$ and
    $y_{n}\stackrel{p}{\to}a$, then

    -   $x_{n}+y_{n}\stackrel{d}{\to}x+a$

    -   $x_{n}y_{n}\stackrel{d}{\to}ax$

    -   $x_{n}/y_{n}\stackrel{d}{\to}x/a$ if $a\neq0$.

Slutsky’s theorem consists of special cases of the continuous mapping
theorem 2. Only because the addition, multiplication and division are
encountered so frequently in practice, we list it as a separate theorem.
