## 2. Probability distributions

### 2.1 Hypergeometric distribution

Density function. The function `hypergeom.pmf` requires four arguments, `k = k`, `M = N`, `n = M`, and `N = n`. In *Case: Number of students*, we sampled $n = 60$ items from a list of $N = 331$ PhDs. We assume that $M = 17$ have erroneously been listed. The probability $P(k = 0)$ of finding no errrors in the sample is calculated as follows.

In [50]:
from scipy.stats import hypergeom

# Parameters
N = 331  # Total number of items in the population
M = 17   # Number of items in the population that are classified as successes
n = 60   # Number of items drawn from the population
k = 0    # Number of successes in the draw

# Calculate the hypergeometric probability mass function (PMF)
probability = hypergeom.pmf(k = k, M = N, n = M, N = n)

print("Probability:", probability)


Probability: 0.03036497699701723


Similarly, the probability $P(k = 1)$ of finding one error is

In [51]:
hypergeom.pmf(1, 331, 17, 60)

0.12145990798806892

Cumulative probability function. the probability $P(k \leq 1)$ of finding up to one error is

In [52]:
hypergeom.cdf(1, 331, 17, 60)

0.15182488498508614

This is the sum of the two probabilities $P(k = 0)$ and $P(k = 1)$

In [53]:
hypergeom.pmf(0, 331, 17, 60) + hypergeom.pmf(1, 331, 17, 60)

0.15182488498508614

We can also calculate the right-hand tail probability $P(k > 1)$.

In [54]:
hypergeom.sf(1, 331, 17, 60)

0.8481751150149138

This is the complement of the left-hand tail probability $P(k \leq 1)$, in other words, the two probabilities sum to one.

In [55]:
hypergeom.cdf(1, 331, 17, 60) + hypergeom.sf(1, 331, 17, 60)

1.0

### 2.2. Binomial distribution

Density function. The function `binom.pmf` requires two arguments, `k = k`, `n = n`, and `p = M / N`. The probabilty `p` is the error rate in the population, in *Case: Number of students* this is $p = M / N = 17 / 331$.


In [56]:
from scipy.stats import binom

# Parameters
N = 331  # Total number of items in the population
M = 17  # Number of items in the population that are classified as successes
n = 60   # Number of items drawn from the population
k = 1    # Number of successes in the draw

# Calculate the binomial probability mass function (PMF)
probability = binom.pmf(k = k, n = n, p = M / N)

print("Probability:", probability)

Probability: 0.1373313791683671


Cumulative probability function.

In [57]:
binom.cdf(k = k, n = n, p = M / N)

0.17960790177509958

### 2.3 Poisson distribution

Density function. The function `poisson.pmf` requires two arguments, `k = k`, and `mu = np`, with `p` defined as for the binomial distribution.


In [58]:
from scipy.stats import poisson

# Parameters
N = 331  # Total number of items in the population
M = 17   # Number of items in the population that are classified as successes
n = 60   # Number of items drawn from the population
k = 1    # Number of successes in the draw

# Calculate the Poisson probability mass function (PMF)
probability = poisson.pmf(k = k, mu = n * M / N)

print("Probability:", probability)

Probability: 0.1414043918733124


Similarly, for $k = 3$

In [59]:
poisson.pmf(3, 60 * 17 / 331)

0.22379789843860848

Cumulative probability function.

In [60]:
poisson.cdf(1, 60 * 17 / 331)

0.1872915033537697

### 2.4 Normal distribution 

Distribution function. The function `norm.cdf` requires three arguments, `q` for the sampling result that we evaluate, `mean` for the population mean, and `sd` for the standard deviation of the mean.

In [61]:
from scipy.stats import norm

# Parameters
result  = 1012                    # Sampling result to evaluate
popmean = 1030                    # Population mean
sd      = 115.26 / math.sqrt(200) # Standard deviation of the mean

# Calculate the normal cumulative density function (CDF)
probability = norm.cdf(x = result, loc = popmean, scale = sd)

print("Probability:", probability)

Probability: 0.013602685762330115


Note that the syntax is more elaborate than what we have done so far. In Python we can choose between passing the values of the arguments only and passing values specifically assigned to the arguments that the function uses.

There are two advantages to using argument names and values. First, the command is easier to interpret by a reviewer. Second, we can pass the arguments in any order. The following commands are therefore equivalent:

In [62]:
norm.cdf(1012, 1030, 115.26 / math.sqrt(200))

0.013602685762330115

In [63]:
norm.cdf(x = 1012, loc = 1030, scale = 115.26 / math.sqrt(200))

0.013602685762330115

In [64]:
norm.cdf(scale = 115.26 / math.sqrt(200), x = 1012, loc = 1030)

0.013602685762330115

### 2.5 Student's $t$ distribution

To calculate the probability found of 14.71\%, we use the function `t.cdf` with arguments `x` for the boundary value and `df` for the number of degrees of freedom.

In [71]:
from scipy.stats import t

# Parameters
tval = (1004 - 1030) / (73.8 / math.sqrt(10)) # Boundary value
df   = 9                                      # Degrees of freedom

# Calculate the normal cumulative density function (CDF)
probability = t.cdf(x = tval, df = df)

print("Probability:", probability)

Probability: 0.14705532622052686


### 2.6 $\chi^2$ (chi-squared) distribution

The 95\% upper bound on a $\chi^2$ distributed variable with (10 - 1) degrees of freedom is

In [72]:
from scipy.stats import chi2

# Parameters
df   = 9                                      # Degrees of freedom

# Calculate the normal cumulative density function (CDF)
upper_bound = chi2.ppf(q = .95, df = df)

print("Upper bound:", upper_bound)

Upper bound: 16.918977604620448


Similarly, to calculate the upper bound for $\sigma^2$, we start with the lower bound of a $\chi^2$ distributed variable


In [67]:
lower_bound = chi2.ppf(q = .05, df = df)

print("Lower bound:", lower_bound)

Lower bound: 3.325112843066815


### 2.7 $F$ distribution

Hypothesis testing is done with the function `f.ppf`.

In [68]:
from scipy.stats import f

# Parameters
q   = .95 # Significance
dfn = 25  # Degrees of freedom nominator
dfd = 23  # Degrees of freedom denominator

# Calculate the normal cumulative density function (CDF)
f_crit = f.ppf(q = q, dfn = dfn, dfd = dfd)

print("Critical value:", f_crit)

Critical value: 1.9962706179379208


The probability of obtaining this particular value of the test statistic is calculated with the `f.sf` function.

In [69]:
f.sf(x = f_crit, dfn = dfn, dfd = dfd)

0.049999999999999975