# Chapter 3: Random Variables and their Distributions
 
This Jupyter notebook is the Python equivalent of the R code in section 3.11 R, [Introduction to Probability, 1st Edition](https://www.crcpress.com/Introduction-to-Probability/Blitzstein-Hwang/p/book/9781466575578), Blitzstein & Hwang.

----

## Distributions in SciPy

All of the named distributions that we'll encounter in this book have been implemented in R, but in this Python-based notebook we will use their equivalents in SciPy. In this section we'll explain how to work with the Binomial and Hypergeometric distributions in [`scipy.stats`](https://docs.scipy.org/doc/scipy/reference/tutorial/stats.html). We will also explain in general how to generate r.v.s from any discrete distribution with a finite support. The aforementioned Statistical functions (`scipy.stats`) page is a handy list of the distributions in `scipy.stats`. Typing `scipy.stats.[distribution].__doc__` will display more information on the named distribution.

In general, for many named discrete distributions, three functions `pmf`, `cdf`, and `rvs` will give the PMF, CDF, and random generation, respectively.

In [1]:
import numpy as np

np.random.seed(99)

### Binomial distribution

The Binomial distribution [`binom`](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.binom.html#scipy.stats.binom) provides the following three functions: `pmf`, `cdf`, and `rvs`. Unlike R, SciPy provides an implementation of the Bernoulli distribution But as an alterative for `bernoulli`, we can just use the Binomial functions with $k \in \{0,1\}$ and $n = 1$.

In [2]:
from scipy.stats import binom

#print(binom.__doc__)

* `binom.pmf` is `scipy.stats`' implementation of the Binomial PMF. It takes three inputs: the first is the value of `k` at which to evaluate the PMF, and the second and third are the parameters `n` and `p`. For example, `binom.pmf(3, 5, 0.2)` returns the probability $P(X = 3)$ where $X \sim Bin(5, 0.2)$. In other words,
 
 
\begin{align}
  binom.pmf(3, 5, 0.2) &= \binom{5}{3} (0.2)^{3} (0.8)^{2} \\
  &= 0.0512 
\end{align}

In [3]:
k = 3
n = 5
p = 0.2

binom.cdf(k, n, p)

0.99328000000000005

* `binom.cdf` is the Binomial CDF. It takes three inputs: the first is the value of `k` at which to evaluate the CDF, and the second and third are the parameters. $binom.cdf(3, 5, 0.2)$ is the probability $P(X \leq 3)$ where $X \sim Bin(5, 0.2)$. So

\begin{align}
  binom.cdf(3, 5, 0.2) &= \sum_{k=0}^{3} \binom{5}{k} (0.2)^k (0.8)^{5-k} \\
  &= 0.9933
\end{align}

In [4]:
k = 3
n = 5
p = 0.2

binom.cdf(k, n, p)

0.99328000000000005

* `binom.rvs` is a function for generating Binomial random variables. For `rvs`, the first and second inputs are still the parameters `n` and `p`, and the `size` parameter is how many r.v.s we want to generate. Thus the command `binom.rvs(5, 0.2, size=7)` produces realizations of seven i.i.d. $Bin(5, 0.2)$ r.v.s. When we ran this command, we got

In [5]:
n = 5
p = 0.2

binom.rvs(n, p, size=7)

array([1, 1, 2, 0, 2, 1, 0])

Unless you change the `numpy.random.seed` parameter value in code cell [1] above, you'll get the same values above.

We can also evaluate PMFs and CDFs at an entire vector of values. For example, recall that `numpy.arange(0,n+1)` is a quick way to list the integers from $0$ to $n$. The command `binom.pmf(numpy.arange(0, 5+1), 5, 0.2)` returns 6 numbers, $P(X = 0)\text{, } P(X = 1)\text{, } \cdots \text{, } P(X = 5)$, where $X \sim Bin(5, 0.2)$.

In [6]:
n = 5
p = 0.2

binom.pmf(np.arange(0, n+1), n, p)

array([  3.27680000e-01,   4.09600000e-01,   2.04800000e-01,
         5.12000000e-02,   6.40000000e-03,   3.20000000e-04])

In [7]:
binom.pmf(0, 1, 0.4)

0.59999999999999998

## Hypergeometric distribution 

The Hypergeometric distribution also has three functions: dhyper, phyper, and rhyper. As one might expect, dhyper is the Hypergeometric PMF, phyper is the Hypergeometric CDF, and rhyper generates Hypergeometric r.v.s. Since the Hypergeometric distribution has three parameters, each of these functions takes four inputs. For dhyper and phyper, the first input is the value at which we wish to evaluate the PMF or CDF, and the remaining inputs are the parameters of the distribution.

Thus dhyper(k,w,b,n) returns the probability P(X = k) where X ∼ HGeom(w, b, n), and phyper(k,w,b,n) returns P(X ≤ k). For rhyper, the first input is the number of r.v.s we want to generate, and the remaining inputs are the parameters; rhyper(100,w,b,n) generates 100 i.i.d. HGeom(w, b, n) r.v.s.

## Discrete distributions with finite support

We can generate r.v.s from any discrete distribution with finite support using the sample command. When we first introduced the sample command, we said that it can be used in the form sample(n,k) or sample(n,k,replace=TRUE) to sample k times from the integers 1 through n, either without or with replacement. For example, to generate 5 independent DUnif(1, 2,..., 100) r.v.s, we can use the command sample(100,5,replace=TRUE).

It turns out that sample is far more versatile. If we want to sample from the values x1,..., xn with probabilities p1,..., pn, we simply create a vector x containing all the xi and a vector p containing all the pi, then feed them into sample. For example, suppose we are interested in generating realizations of i.i.d. r.v.s X1,..., X100 whose PMF is

\begin{align}
  P(Xj = 0) &= 0.25 \text{, } \\ 
  P(Xj = 1) &= 0.5\text{, } \\
  P(Xj = 5) &= 0.1\text{, } \\
  P(Xj = 10) &= 0.15\text{, } 
\end{align}

and P(Xj = x) = 0 for all other values of x. First, we use the c function to create vectors with the support of the distribution and the PMF probabilities.


Next, we use the sample command. Here’s how to get 100 samples from the above PMF:

The inputs are the vector x to sample from, the number of samples to generate (100 in this case), the probabilities p of sampling the values in x (if this is omitted, the probabilities are assumed equal), and whether to sample with replacement.

----

&copy; Blitzstein, Joseph K.; Hwang, Jessica. Introduction to Probability (Chapman & Hall/CRC Texts in Statistical Science).