# Moment generating functions tutorial

- Written by Michael Shadlen, 2003 (Revised 2004)
- Translated to Python by Michael Waskom, 2020

This tutorial introduces the moment generating function. It complements (and repeats to some extent) the section in the Mathematica tutorial on Wald's identity (Wald_identity.nb).

The goal is to develop a few basic intuitions that are essential for understanding Wald's identity and its connection to the psychometric function. Here are the topics you need to understand:

- The basic definition of the MGF
- The use of the MGF to calculate moments
- The MFG for a random variable, $Z$, that is the sum of two independent random variables, $X + Y$: the product of the MGF's associated with these two random variables.

One of the more important goals is to gain an intuition for how changes in the properties of a random variable, its mean and variance, affect a special point on the moment generating function where it reaches 1: the special root, $\theta_1$. The reason we care about this is spelled out in the mathematica tutorial. It is the connection between Wald's Identity and the psychometric function.

This short tutorial accomplishes two things.
- In part 1, we define the moment generating function (mgf) and use it to compute moments (hence the name). 
- In the second part, we look for the existence of a special root to the mgf. This is the nonzero solution of the equation $mgf(Z) = 0$.

In [None]:
import numpy as np
from scipy import stats, integrate
import matplotlib.pyplot as plt

## Part 1. The Moment Generating Function: Definition and properties

The MGF is a transform of a probability function or probability density function. The PDF, $f(x)$, is the likelihood (or probability) of observing a random value, $x$. The MFG, $M(\theta)$ is a function of a new variable. Think of it the same way you would think of a fourier or laplace transform.

Let's begin with an example of something familiar, the normal distribution.

In [None]:
dx = .01
x = np.arange(-10, 10 + dx, dx)

# Mean and standard deviation
u, s = 1, 1.2
pdf = stats.norm(u, s).pdf(x)

# Plot the probability density function (PDF)
f, ax = plt.subplots()
ax.plot(x, pdf)
ax.set(
    xlabel="$x$",
    ylabel="$f(x)$",
    ylim=(0, None),
)
f.tight_layout()

The moment generating function is a function of a new variable, theta. It is the expected value of $e^{\theta x}$, where $x$ is our random variable. Notice that for any value of $\theta$, the expectation of $e^{\theta x}$ is a number.

Discretely, the mgf at each value of theta is the expectation of `exp(theta[i] * x)`

The expectation is just the sum of `exp(theta[i] * x)` with each term weighted by the probability of of observing x.

In [None]:
def mgf(x, theta, pdf):
    H = np.exp(np.outer(theta, x))
    return integrate.trapz(H * pdf, x)

In [None]:
dtheta = .01
theta = np.arange(-2.5, 1 + dtheta, dtheta)
y = mgf(x, theta, pdf)

In [None]:
f, ax = plt.subplots()
ax.plot(theta, y)
ax.axhline(1, color=".6", ls="--")
ax.set(
    xlim=(theta.min(), theta.max()),
    xlabel=r"$\theta$",
    ylabel=r"$M(\theta)$",
)
f.tight_layout()

The goal of the next section is to get a feel for why the MGF looks the way it does. Let's just notice afew things about this function.

First, $M(0) = 1$:

In [None]:
mgf(x, 0, pdf)

It should also be obvious that all values must be greater than or equal to 0. Remember, it's just a weighted average of exp(...). Notice that on one side of 0, the function keeps getting bigger. On the other side of 0, it dips below 1 and then rises again, crossing 1 at another point. That crossing point turns out to be very important for psychophysics. We'll spend time on it later. For the moment (no pun intended) let's develop an intuition for these properties.

The moment generating function has its name because if you take its derivative at theta=0, you get the 1st moment of x. In other words the mean. If you take the second derivative, again evaluated at theta=0, you obtain the 2nd moment: the average of the square of x. And so forth for higher derivatives and moments. 

This is easy to see in the math. Look at section 1.2 of Wald_identity.nb.  Here, let's convince ourselves of this using numerical approximations.

The first moment is the expectation of x (i.e., the mean). We know what to expect, because we set the mean `u`:

In [None]:
# Get it from the random variable object
assert np.isclose(u, stats.norm(u, s).moment(1))
assert np.isclose(u, stats.norm(u, s).stats("m"))

# Calculate the expectation
assert np.isclose(u, integrate.trapz(x * pdf, x))

# Differentiate the moment generating function
d = np.diff(y) / dtheta

# The numerical differntiation is right-sided; interpolate to find the derivative
idx = np.argmin(np.abs(theta))
d_0 = d[idx - [0, 1]].mean()
assert np.isclose(u, d_0, rtol=1e-4)

This also provides some intution for why the MGF's appearance. At `theta = 0`, we know that the tangent to the curve should have `slope = u`:

In [None]:
L = np.abs(theta) < .35
tangent = d_0 * theta + 1
ax.plot(theta[L], tangent[L], color=".15")
f

The second moment is the expectation of $x^2$. We also know what this should be, because we set `u` and `s`, above.

In [None]:
m2 = s ** 2 + u ** 2

# Get from the scipy object
assert np.isclose(m2, stats.norm(u, s).moment(2))

# Calculate it directly
assert np.isclose(m2, integrate.trapz(x ** 2 * pdf, x))

# Or use the moments
d2 = np.diff(y, 2) / (dtheta ** 2)
d2_0 = d2[idx - 1]
assert np.isclose(d2_0, m2, rtol=1e-4)

Looking at the curve, we're not surprised that it is convex up at `theta = 0`. The second moment has to be positive. If the variance were larger, then the degree of convexity would increase, making for a tighter U-shaped curve.

Let's pay attention to the point on the left portion of the curve where it crosses the dashed line. The 
value of theta where this occurs is $\theta_1$. This value is going to turn out to be very important to us.

Let's define a function that plots the PDF and MGF for arbitrary distributions.

In [None]:
def plot_pdf_mgf(x, theta, *distributions):

    f, (ax_pdf, ax_mgf) = plt.subplots(1, 2, figsize=(9, 5))

    for d in distributions:
        ax_pdf.plot(x, d.pdf(x))
        ax_mgf.plot(theta, mgf(x, theta, d.pdf(x)))
    ax_mgf.axhline(1, color=".6", dashes=(3, 1.5), lw=1)
    
    ax_pdf.set(
        xlabel="$x$",
        ylabel="$f(x)$",
        xlim=(x.min(), x.max()),
        ylim=(0, None),
    )

    ax_mgf.set(
        xlabel=r"$\theta$",
        ylabel=r"$M(\theta)$",
        xlim=(theta.min(), theta.max()),
        ylim=(0, 5),
    )

    f.tight_layout()

First, let's change the standard deviation and hold the mean constant.

In [None]:
u, s = 1, 1.2
d1 = stats.norm(u, s)
d2 = stats.norm(u, 1.5 * s)
plot_pdf_mgf(x, theta, d1, d2)

This is an important intuition to hang on to. If the convexity were greater, the second point where the curve crosses the horizontal line would move closer to $\theta=0$. In other words $\theta_1$ would be smaller in absolute magnitude.

If the variance were smaller, the convexity would be lower and $\theta_1$ would move off to the left, further from $0$.

In [None]:
u, s = 1, 1.2
d1 = stats.norm(u, s)
d2 = stats.norm(u, .8 * s)
plot_pdf_mgf(x, theta, d1, d2)

If the mean were larger, but the convexity did not change, then the slope at 0 would increase, and that would push $\theta_1$ away from 0.

In [None]:
u, s = 1, 1.2
d1 = stats.norm(u, s)
d2 = stats.norm(1.5 * u, s)
plot_pdf_mgf(x, theta, d1, d2)

If the mean were smaller, but the convexity did not change, then the slope at 0 would decrease, and that would pull $\theta_1$ closer to the origin

In [None]:
u, s = 1, 1.2
d1 = stats.norm(u, s)
d2 = stats.norm(.8 * u, s)
plot_pdf_mgf(x, theta, d1, d2)

Now check this out. We can pit these two tendencies against each other. It turns out that their effects cancel if we change the mean and variance by the same amount.

In [None]:
u, s = 1, 1.2
d1 = stats.norm(u, s)
d2 = stats.norm(1.5 * u, np.sqrt(1.5) * s)
plot_pdf_mgf(x, theta, d1, d2)

That's an important observation. For the gaussian distribution this special crossing point, $\theta_1$, is a function of the ratio of variance and mean. If they are kept the same, $\theta_1$ does not change. That's not to say that the MGF doesn't change. You can see from Figure 10 that it does. But that value, $\theta_1$, is going to turn out to be pivotal to the performance.

Already, you should be scratching your head skeptically. Shouldn't performance have something to do with signal to noise ratio, hence mean and standard deviation? Yet the $\theta_1$ depends on mean and *variance*.  Store the thought. We have yet to connect $\theta_1$ to the psychometric function.

Let's expand our intuitions by looking at MGFs associated with mean = 0 or mean < 0.  Let's start with the former.

When the mean is zero, we know that the slope of the MGF at theta=0 is flat. Moreover, the function is convex up. So there's no 2nd crossing of the horizontal line. $\theta_1$ is effectively 0.

In [None]:
u, s = 0, 1
d1 = stats.norm(u, s)
d2 = stats.norm(u, 1.5 * s)
plot_pdf_mgf(x, theta, d1, d2)

When the mean is less than 0, everything simply flips around the y-axis. I think this only seems obvious, but if you think about it (or do the math) you'll see that if pdf2(x) = pdf1(-x), then the weighted averages of $e^{\theta x}$ with weights given by pdf2 have to correspond to the opposite signed theta. In any case we can see this graphically.

In [None]:
theta = np.arange(-2.5, 2.5 + dtheta, dtheta)
u, s = 1, 1.2
d1 = stats.norm(u, s)
d2 = stats.norm(-u, s)
plot_pdf_mgf(x, theta, d1, d2)

The ability to think about the negative version of a random variable will be important in a moment when we consider sums of RVs.

## The MGF of a sum of RVs

I may fill this in with more intuition later. But here's the bottom line. Suppose $x_1$ has distribution $f_1(x)$ and moment generating function $M_1(\theta)$, and suppose $x_2$ has distribution $f_2(x)$ and moment generating function $M_2(\theta)$.  Consider the random variable, $Z$. It has a pdf that is the convolution of $f_1(x)$ and $f_2(x)$. It has a moment generating function that is the *product* $M1(\theta)$ and $M2(\theta)$.

It may seem surprising that the pdf $f(z)$ is a convolution, but think about it. Suppose we know the value, $x_1$. Then the conditional probability of observing any $Z$ as the sum $x_1 + x_2$  is the probability of choosing $x2 = z - x_1$. That's $f_2(z-x_1)$. To get the probability of getting $Z$ from the any sum, it's just a matter of integrating this conditional probability across all possible values of $x_1$:

$$f(z) = \int f_1(x_1)f_2(z - x_1)\, dx_1$$
   

In other words, `f_z = conv(f_1, f_2)`. (I'm not writing real code here. We would have to be careful about axes.)

I'm not going to go through the math, but instead I'm appealing to an intution that I hope we share. If we were dealing with functions of time or space, we would know that the fourier transform of `f_z` would be the product of the fourier transforms of `f_2` and `f_2`. The moment generating function is a lot like a fourier transform.

I may flesh this out one day.

There are two points relevant to our topic. First, if we were to make a new RV from the difference of two RVs, we need simply multiply the $M_1(\theta)$ by $M_2(-\theta)$. That's because we're adding $x_1$ and $-x_2$. Second, when we look at Wald's identity, we should not be surprised the the moment generating function for a sum of RVs is equal to the product of MGFs for the increments.

## Part 2: MGF of differences of random variables

The MGF of a variety of RVs formed by taking the difference of two RVs has a nonzero $\theta_1$. Recall that if $Z$ is the sum of two RVs, $X + Y$, each with MGF $M_X(\theta)$ and $M_Y(\theta)$, the MGF associated with $z$ is $M_Z = M_X(\theta)M_Y(\theta)$. Subtraction is like adding one of the RVs with its sign switched. This is $M_X(\theta)M_Y(-\theta)$.

 et's look at the dfference between two RVs described by two Weibull distributions.

In [None]:
x = np.arange(0, 20 + dx, dx)
theta = np.arange(-1, 1 + dtheta, dtheta)
d1 = stats.weibull_min(1.2, 0, 3)
d2 = stats.weibull_min(2.4, 0, 3)
plot_pdf_mgf(x, theta, d1, d2)
f = plt.gcf()

Now add in the MGF for the difference of the second Weibull minus the first. Notice the  $-\theta$ on the first MGF:

In [None]:
y = mgf(x, -theta, d1.pdf(x)) * mgf(x, theta, d2.pdf(x))
f.axes[1].plot(theta, y)
f

Now use the MGF to get the mean difference.

In [None]:
d = np.diff(y) / np.diff(theta)
idx = np.argmin(np.abs(theta))
d_0 = d[idx - [0, 1]].mean()

L = np.abs(theta) < .25
tangent = d_0 * theta + 1
f.axes[1].plot(theta[L], tangent[L], color=".2")
f

Here is the second moment

In [None]:
dd = np.diff(y, 2) / dtheta ** 2
dd_0 = dd[idx - 1]
var_est = dd_0 - d_0 ** 2

Compare our estimations of the mean and variance:

In [None]:
assert np.isclose(d_0, d2.mean() - d1.mean(), rtol=1e-2)
assert np.isclose(var_est, d2.var() + d1.var(), rtol=1e-2)