# Statistics


In [None]:
import matplotlib.pyplot as plt
import numpy as np
import scipy.stats as st

from ipywidgets import interact, FloatSlider, IntSlider

%matplotlib inline

### What is a random variable?

A random variable is a variable that takes a stochastic value from a set of possible outcomes.

* Let $\mathcal{Y}$ denote the set of all possible outcomes
* Let $f_Y$ be the probability distribution, then for any subset $A \in \mathcal{Y}$, $f_Y(A) \equiv \text{Probability}(A) = \int_{x \in A} f_Y(x) dx$
* We will use $Y \in \mathcal{Y}$ to denote a vector of random variables (potentially of length 1)
* We will use $y$ to denote a particular realization of the random variable $Y$


A sequence of i.i.d. random variables is itself a random variable.

Let $\{Y_i\}$ be a sequence of independent and identically distributed random variables, then $\forall i$ $Y_i \sim f_Y$.

**Example**:

If $Y$ is a Normal random variable with mean $\mu$ and standard deviation $\sigma$, then the set of outcomes is $\mathbb{R}$ and the probability density function of the probability distribution is given by

$$f(x) = \frac{1}{\sigma \sqrt{2 \pi}} \exp^{-\frac{(x - \mu)^2}{2 \sigma^2}}$$

In [None]:
mu = 0.0
sigma = 1.0

Y = st.norm(mu, sigma)

y = Y.rvs(size=5)


In [None]:
print(Y)  # A random variable


In [None]:
Y.support()  # The possible outcomes


In [None]:
fig, ax = plt.subplots()

lspace = np.linspace(-5.0, 5.0, 100)
ax.plot(lspace, Y.pdf(lspace))  # Probability density function


In [None]:
print(y)  # Observations


### A function of a random variable is a random variable

Let $X$ be a random variable such that $X \equiv g(Y)$ where $g$ is an invertible function.

Let $z$ be a constant. $g(X)$ is a random variable with probability distribution given by:

$$F_X(z) = \text{Probability}(X \leq z) = \text{Probability}(g(Y) \leq z) = F_Y(g^{-1}(z))$$

**Example**

Let $Y$ be a normally distributed random variable with mean 0 and standard deviation 1.

If $X \equiv \exp(Y)$, what is the probability distribution associated with $X$?

\begin{align*}
  \text{Prob}(x \leq z) &= \text{Prob}(\exp(y) \leq z) \\
  &= \text{Prob}(y \leq \log(z)) \\
\end{align*}

One can show that this produces the cumulative density function that corresponds to the log-normal distribution.


In [None]:
mu = 0.0
sigma = 1.0

Y = st.norm(loc=mu, scale=sigma)
Z = st.lognorm(scale=np.exp(mu), s=sigma)  # Read docs on why parameterized this way

x = np.exp(Y.rvs(100_000))
z = Z.rvs(100_000)


In [None]:
fig, ax = plt.subplots()

ax.hist(x, bins=50, alpha=0.25, color="red", density=True)
ax.hist(z, bins=50, alpha=0.25, color="black", density=True)

ax.set_xlim(0.0, 25.0);

### What is a statistic??

Let $Y$ be a random variable. Let $\tilde{y}$ be an i.i.d. sample of length $n$ from the $Y$.

A statistic is a function, $g$, that maps samples from $Y$, denoted by $\tilde{y}$, into a scalar or vector.



This means... A statistic is a random variable because it is a function of a random variable.

**Example: Sample mean**

A sample mean is a statistic.

$$\bar{y} \equiv \frac{1}{n} \sum_i \tilde{y}_i$$

In [None]:
Y = st.norm(0.0, 1.0)

ytilde = Y.rvs(10_000)

np.mean(ytilde)


**Example: Counting**

Let $Y$ be a discrete random variable with $m$ possible values denoted $\{\bar{y}_i\}_{i=1}^m$. Counting the number of occurrences of each of the $m$ possible values is a statistic.

$$\{c_{i}\}_{i=1}^m \equiv \left\{ \sum_{j=1}^n \mathbb{1}_{\tilde{y}_j == \bar{y}_i}\right\}_{i=1}^m$$

In [None]:
# Random variable that takes integers from lb to ub-1
# with equal probability
Y = st.randint(0, 5)

ytilde = Y.rvs(5_000)

np.array([np.sum(ytilde == i) for i in range(0, 5)])

## Models

### What is a model?

A model is a probability distribution for $Y$ indexed by a vector of parameters ($\theta$)

We write this as $f(Y | \theta)$ or $f_{\theta}(Y)$.

**Example**

Suppose that quarterly GDP growth can be described by

$$y = \theta + \varepsilon$$

where $\theta = 1.5$ and $\varepsilon \sim N(0, 1)$

In [None]:
def plot_model_distribution(theta):
    fig, ax = plt.subplots()

    # Create  grids on the x and y plane
    yvals = np.linspace(-10.0, 10.0, 200)

    # Parameters
    f_epsilon = st.norm(0.0, 1.0)

    # Create
    prob = f_epsilon.pdf(yvals - theta)

    ax.plot(yvals, prob)

    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)
    ax.set_xlabel(r"$y$")
    ax.set_ylabel(r"$f(y)$")
    ax.set_title("A single model");

    return fig, ax

In [None]:
plot_model_distribution(-2.5);


### Manifolds of models

A manifold of statistical models consists of a set of joint probability distributions, $f(Y | \theta)$, "swept out" by $\theta \in \Theta$.

Once we pick a $\theta$, we have a model and can simulate or do anything else that we might want do with a model.

(_spoiler alert: we'll talk about "how to pick a $\theta$" soon_)

**Example**

A manifold of models for quarterly GDP growth could be described by

$$y = \theta + \varepsilon$$

for $\theta \in \Theta$ and where $\varepsilon \sim N(0, 1)$.

In [None]:
def plot_manifolds(theta=1.5):

    # Values for plotting
    thetas = np.linspace(-3.5, 3.5, 50)
    yvals = np.linspace(-10.0, 10.0, 500)
    tt, yy = np.meshgrid(thetas, yvals)
    mod_probs = st.norm(0.0, 1.0).pdf(yvals - theta)
    mf_probs = st.norm(0.0, 1.0).pdf(yy - tt)

    # Create figure
    fig = plt.figure(figsize=(10, 8))
    fig.suptitle("Manifold of Models")

    ax1 = fig.add_subplot(projection="3d")

    ax1.plot_surface(
        tt, yy, mf_probs,
        cmap="viridis", linewidth=0.0, alpha=0.5,
        cstride=2, rstride=2
    )
    ax1.plot(yvals, mod_probs, zs=-5.0, zdir="x")
    ax1.plot(
        yvals, mod_probs, zs=theta, zdir="x",
        color="black", linewidth=2.0
    )

    ax1.set_xlabel(r"$\theta$", size=10)
    ax1.set_ylabel(r"$y$", size=10)
    ax1.set_zlabel(r"$f(y)$", size=10)
    ax1.set_xlim(-5.0, 4.0)
    ax1.set_ylim(-12.0, 12.0)
    ax1.view_init(25, -20)

    return None


In [None]:
theta_slider = FloatSlider(
    value=1.5, min=-3.5, max=3.5, step=0.5,
    description="Value of theta"
)

interact(plot_manifolds, theta=theta_slider);

### Likelihood function

Imagine you have a manifold of models indexed by $\theta$ and someone asks you how likely a particular group of observations are to have come from different elements of your manifold.

How could you answer this?

The **likelihood function** for a manifold of statistical models is a function that maps values of the parameters, $\theta \in \Theta$ and observations, $\tilde{y} \in \mathcal{Y}$, into the probability of having observed $\tilde{y}$ given $\theta$.

$$\mathcal{L}(\theta | \tilde{y}) \equiv f(\tilde{y} | \theta)$$

**Example**

Let's continue to build on our quarterly GDP growth example. The manifold of models is given by:

$$y = \theta + \varepsilon$$

for $\theta \in \Theta$ and where $\varepsilon \sim N(0, 1)$.

Let $\tilde{y}$ be $n$ i.i.d. observations generated by a model from the manifold of models above.

What's the likelihood that this sequence was generated by a particular model indexed by $\hat{\theta} \in \Theta$?

\begin{align*}
  \mathcal{L}(\hat{\theta} | \tilde{y}) &= f(\tilde{y} | \hat{\theta}) \\
  &= f(\tilde{y}_1 | \hat{\theta}) f(\tilde{y}_2 | \hat{\theta}) \dots f(\tilde{y}_n) \\
  &= \prod_{i=1}^n \phi(\tilde{y}_i - \hat{\theta}))
\end{align*}

In [None]:
def log_likelihood(theta_hat, ytilde):
    """
    Computes the log likelihood of observing a particular
    group of observations
    
    Parameters
    ----------
    theta_hat : float
        The model parameter
    ytilde : np.array(float, ndim=1)
        i.i.d. sequences of observations
    
    Returns
    -------
    log_ll : float
        The log of the likelihood function -- This is
        more numerically stable than the raw likelihood
    """
    # Compute likelihood of each individual observation
    ll = st.norm(0.0, 1.0).pdf(ytilde - theta_hat)

    # Take logs and sum rather than take product
    log_ll = np.sum(np.log(ll))
    
    return log_ll

In [None]:
log_likelihood(0.0, np.random.randn(10))

Below we take observations of quarterly GDP growth from the United States between the years of 1990 and 2015.

In [None]:
# US quarterly GDP growth from 1990 to 2015
ytilde = np.array([
    2.18, 1.49, 0.92, -0.17, 0.51, 1.52, 1.29, 0.94, 1.57, 1.69, 1.48, 1.74, 0.73, 1.18, 1.07, 1.91,
    1.45, 1.84, 1.16, 1.69, 0.9, 0.78, 1.35, 1.16, 1.23, 2.09, 1.23, 1.58, 1.25, 1.87, 1.69, 1.19, 1.15,
    1.16, 1.69, 1.9, 1.33, 1.14, 1.66, 2.25, 1.05, 2.45, 0.7, 1.16, 0.32, 1.19, -0.01, 0.6, 1.21, 0.97,
    0.91, 0.72, 1.01, 1.16, 2.25, 1.75, 1.28, 1.58, 1.61, 1.78, 1.91, 1.17, 1.8, 1.44, 2.04, 1.07, 0.86,
    1.22, 1.22, 1.22, 1.06, 1.01, -0.21, 1.06, 0.2, -1.86, -1.13, -0.29, 0.47, 1.44, 0.64, 1.39, 1.03, 1.07,
    0.3, 1.38, 0.62, 1.31, 1.41, 0.83, 0.65, 0.63, 1.29, 0.41, 1.27, 1.39, 0.13, 1.92, 1.66, 0.72, 0.86,
    1.22, 0.68, 0.17
])

for _th in [0.0, 1.0, 2.0]:
    print(f"Log likelihood of being generated by model indexed by theta={_th} is")
    print("\t", log_likelihood(_th, ytilde))

In [None]:
def plot_likelihood(ytilde):
    # Values for plotting
    thetas = np.linspace(-3.5, 3.5, 50)
    log_ll = np.array([log_likelihood(_th, ytilde) for _th in thetas])

    # Create figure
    fig, ax = plt.subplots(figsize=(8, 6))
    fig.suptitle("Likelihood function", size=16)

    ax.plot(thetas, log_ll)

    ax.set_xlabel(r"$\theta$", size=12)
    ax.set_ylabel(r"$\log \mathcal{L}(\theta | \tilde{y})$", size=12)

    ax.spines["right"].set_visible(False)
    ax.spines["top"].set_visible(False)

    return fig, ax

plot_likelihood(ytilde);

## Direct and Inverse Problem

### Direct problem

The _direct problem_ is to draw a sample, $\tilde{y}$, from a given model. We also call this "simulating" or "sampling from the model".

**Input to the direct problem**: $\theta$

**Output of the direct problem**: $\tilde{y}$

Once you have samples from the model, $\tilde{y}$, the samples can be used to generate statistics associated with the model.

### Inverse problem

The inverse problem takes a given manifold of models and a sample generated from one of these models, $\tilde{y}$, and infers which model from the manifold generated the data. This is also sometimes called "statistical inference".

**Input to the inverse problem**: $\tilde{y}$, $\{f(\tilde{y} | \theta) \; \forall \theta \in \Theta\}$

**Output of the inverse problem**: $\hat{\theta}$

Once you have chosen a $\hat{\theta} \in \Theta$, you have a model that you could use in the direct problem.

## Estimators

Let $\tilde{y}$ be a sample of $n$ i.i.d observations generated by a model from a particular manifold of models.

An estimator is a function that maps a sample, $\tilde{y}$, into a model from the manifold of models. We will use $\hat{\theta}_n$ (or just $\hat{\theta}$) to be the estimator of $\theta$ for a given manifold of models.

Estimators of parameters are statistics, but statistics are not necessarily estimators

**Advisory!**

The parameter $\theta$ of a model is fixed. It is **not** a random variable.

However, $\hat{\theta}_n$ is a statistic which means it **is** a random variable, which means that it has a sampling distribution.


### Characteristics of estimators

**Consistent**

Given $\tilde{y}$ (of length $n$), an estimator, $\hat{\theta}_n$, of a parameter, $\theta$, is consistent if

$$\hat{\theta}_n \overset{p}{\to} \theta$$

This definition should remind you of the law of large numbers! The law or large numbers is going to play a central role in checking the consistency of an estimator.

**Example 1**

Consider our manifold of models given by

$$y = \theta + \varepsilon$$

for $\theta \in \Theta$ and where $\varepsilon \sim N(0, 1)$.

Let $\tilde{y}$ be a sample of $n$ i.i.d. observations. Consider an estimator of $\theta$, $g^{\dagger}$, be defined by:

$$g^{\dagger}(\tilde{y}) = \frac{1}{n} \sum_i \tilde{y}_i$$

Is $\theta^{\dagger} \equiv g^{\dagger}(\tilde{y})$ consistent?

The estimator is defined by:

\begin{align*}
  g^{\dagger}(\tilde{y}) &= \frac{1}{n} \sum_i \tilde{y}_i \\
  &= \frac{1}{n} \sum_i \theta + \varepsilon_i \\
  &= \theta + \frac{1}{n} \sum_i \varepsilon_i \\
\end{align*}

The LLN tells us $\frac{1}{n} \sum_i \varepsilon_i \overset{p}{\to} 0$ which means $g^{\dagger}(\tilde{y}) \overset{p}{\to} \theta$.

... This case was a little easy. What happens when we have more complicated models?

## Maximum Likelihood

We now formally introduce a class of estimators we hinted at earlier known as the "maximum likelihood" estimators.

These estimators are defined by finding $\hat{\theta}$ such that $\mathcal{L}(\hat{\theta} | \tilde{y}) = \max_{\theta} f(\tilde{y} | \theta)$

### Properties of maximum likelihood

As long as certain conditions are satisfied,

* Consistent
* Sample distribution of $\hat{\theta}$ is approximately normal
* Efficient

### Computing the maximum likelihood estimate

**Example**

Consider our quarterly GDP growth example. The manifold of models is given by:

$$y = \theta + \varepsilon$$

for $\theta \in \Theta$ and where $\varepsilon \sim N(0, 1)$.


Recall that our likelihood function was given by

\begin{align*}
  \mathcal{L}(\hat{\theta} | \tilde{y}) &= \prod_{i=1}^n \phi(\tilde{y}_i - \hat{\theta})) \\
  &= \prod_{i=1}^n \frac{1}{\sigma \sqrt{2 \pi}} \exp \left(\frac{(\tilde{y}_i - \theta)^2}{2 \sigma^2}\right) \\
  &= \frac{1}{\sigma \sqrt{2 \pi}} \exp \left(\sum_i \frac{(\tilde{y}_i - \theta)^2}{2 \sigma^2}\right)
\end{align*}

We can maximize by taking derivatives with respect to $\theta$ and setting equal to 0,

\begin{align*}
  \frac{\partial \mathcal{L}(\hat{\theta} | \tilde{y})}{\partial \theta} &= \frac{1}{\sigma \sqrt{2 \pi}} \exp \left(\sum_i \frac{(\tilde{y}_i - \theta)^2}{2 \sigma^2}\right) \left(\sum_i \frac{-(\tilde{y}_i - \theta)}{ \sigma^2}\right) = 0 \\
\end{align*}

In order to set this to 0, we need the product of the 3 pieces to be 0... Only the last component could be equal to 0, so

\begin{align*}
  0 &= \left(\sum_i \frac{-(\tilde{y}_i - \theta)}{ \sigma^2}\right) \\
  &\rightarrow \theta = \frac{1}{n} \sum_i \tilde{y}_i
\end{align*}

^Notice that we chose a particular estimator for the previous section.

We already have a function that can evaluate the log-likelihood, so let's check our answer numerically!

In [None]:
import scipy.optimize as opt

f = lambda x: -log_likelihood(x, ytilde)

sol = opt.minimize(f, x0=np.array([2.0]))

print(f"Optimizer says: {sol.x[0]}")
print(f"Sample mean is: {np.mean(ytilde)}")

### Identification

Identification has a precise definition.

A parameter vector, $\theta \in \Theta$ is said to be _identified_ by observations $\tilde{y} \in \mathcal{y}$ if the maximum likelihood estimate of $\theta$ is unique.


In [None]:
plot_likelihood(ytilde);

When might a model not be identified?

A model is not identified when there is no unique maximum to the maximum likelihood function. This can arise because

1. The parameters that index the models in your manifold are hard (impossible) to learn about
  - Consider parameters $\beta_1, \beta_2$ that index a manifold of models described by $y = (\beta_1 + \beta_2) x + \varepsilon$.
  - Cannot separate $\beta_1$ or $\beta_2$ in that model no matter how many values of $y$ you are given

In [None]:
fig, ax = plt.subplots()

beta = np.linspace(-2.0, 2.0, 200)
beta_1, beta_2 = np.meshgrid(beta, beta)
foo = st.norm(0, 0.25).pdf(beta_1 - beta_2)

ax.set_xlabel(r"$\beta_1$")
ax.set_ylabel(r"$\beta_2$")
ax.pcolor(
    beta_1, beta_2, foo,
    shading="nearest"
);

2. The data provides insufficient information
  - Consider parameters $\beta_1$ and $\beta_2$ that index a manifold of models described by $y = \beta_1 x_1 + \beta_2 x_2 + \varepsilon$ with $\tilde{y} = \begin{bmatrix} 0.5 \end{bmatrix}$, $x_1 = \begin{bmatrix} 1.0 \end{bmatrix}$, $x_2 = \begin{bmatrix} 1.5 \end{bmatrix}$
  - Will only be able to determine that $0.5 = \beta_1 + 1.5 \beta_2$
  - This is similar to previous example in the sense that it will create a parameter ridge

## Another Example:

For the sake of variety, we now move on to a slightly different class of model than we have been using.

Consider a class of 50 students at a university.

Let each student's "innate" ability be observable and given by a random variable, $A \sim N(0, 1)$.

Each student passes their classes with a probability

$$p_i = \frac{1}{1 + \gamma_1 e^{-\gamma_2 a_i}}$$

where $a_i$ is their ability level.

In [None]:
def compute_p(g1, g2, a):
    return 1 / (1 + g1*np.exp(-g2*a))

def plot_p(g1, g2):
    a = np.linspace(-3.0, 3.0, 50)
    p = compute_p(g1, g2, a)
    
    fig, ax = plt.subplots()
    
    ax.plot(a, p)
    
    return None
    
g1_slider = FloatSlider(
    value=0.5, min=0.00, max=3.5, step=0.1,
    description="Value of g1"
)
g2_slider = FloatSlider(
    value=0.3, min=0.00, max=3.5, step=0.1,
    description="Value of g2"
)

interact(plot_p, g1=g1_slider, g2=g2_slider);

Suppose that $\gamma_1 = 0.3$, and $\gamma_2 = 1.5$

**Direct problem**

In [None]:
# Parameters
g1 = 0.3
g2 = 1.5
n = 50

# Generate abilities, passing probabilities
a_i = np.random.randn(n)
p_i = compute_p(g1, g2, a_i)

# Use passing probabilities to determine whether students passed
passed = np.random.rand(n) < p_i

_Statistics of interest_


In [None]:
print(f"Fraction that passed is: {np.mean(passed)}")

In [None]:
print(f"Correlation between ability and pass is: {np.corrcoef(a_i, passed)[0, 1]}")

**Inverse problem**

Given a manifold of models and a sample of data, can we infer which model generated our data?

**Likelihood**

\begin{align*}
  \mathcal{L}(\theta | \tilde{y}) &= f(\tilde{y} | \theta) \\
  &= \prod_i f(\tilde{y}_i | \theta) \\
  &= \prod_i p_i \mathbb{1}_{\text{pass}} + (1 - p_i) (1 - \mathbb{1}_{\text{pass}}) \\
  &= \prod_i \frac{1}{1 + \gamma_1 e^{-\gamma_2 a_i}} \mathbb{1}_{\text{pass}} + (1 - \frac{1}{1 + \gamma_1 e^{-\gamma_2 a_i}}) (1 - \mathbb{1}_{\text{pass}})
\end{align*}

It's possible that there's a way to maximize this analytically, but we have great numerical tools at our disposal so we're going to pass.

In [None]:
def loglikelihood_uni(theta, a_i, ytilde):
    # Unpack params and compute p_is
    g1, g2 = theta
    p_i = compute_p(g1, g2, a_i)

    # Compute likelihoods
    lls = np.array(
        [p_i if ytilde[i] else (1 - p_i) for (i, p_i) in enumerate(p_i)]
    )

    return np.sum(np.log(lls))


def neg_loglikelihood_uni(theta, a_i, ytilde):
    return -loglikelihood_uni(theta, a_i, ytilde)

In [None]:
loglikelihood_uni(np.array([0.2, 1.2]), a_i, passed)

**Moment of truth**

In [None]:
sol = opt.minimize(
    neg_loglikelihood_uni, np.array([1.75, 1.75]),
    args=(a_i, passed), method="nelder-mead"
)

In [None]:
print(f"Parameters that generated the data were: {g1, g2}")
print(f"Maximum likelihood parameters were: {sol.x}")

### When to use maximum likelihood

If you have a model or manifold of models, always write down the likelihood function. Fisher showed that if do maximum likelihood that the estimator has great asymptotic properties and is well behaved (everything a frequentist would want from an estimator!).

If you can write it down...

There are some models (economic models especially) where you can't write the likelihood function, however, you can often still simulate the model.

If we can't write down a likelihood (but we can simulate it) then we are forced to turn to other methods which we will discuss next time.