*Credit*: some material here has been adapted from [Machine Learning: A Probabilistic Perspective](https://www.cs.ubc.ca/~murphyk/MLbook/) by Kevin P. Murphy (Chapter 2).


In [None]:
import numpy as np
import matplotlib.pyplot as plt

# Discrete and Continuous Probabilities

## What is probability?
 
* At least two different interpretations:
    * **Frequentist**: probabilities are long-run frequencies of events
    * **Bayesian**: probabilities are used to quantify our **uncertainty**

One advantage of the Bayesian interpretation is that it can be used to model events that do not have long-term frequencies. 

## Discrete random variables

$P(A)$ denotes the probability that the event $A$ is true

* $0 \leq P(A) \leq 1$

We write $P(\bar{A})$ to denote the probability of the event "not $A$"

* $P(\bar{A}) = 1 - P(A)$

We can extend the notion of binary events by defining a **discrete random variable** $X$ which can take on any value from a finite or countably infinite set $\mathcal{X}$. We denote the probability of the event that $X = x$ by $P(X = x)$ or just $p(x)$ for short. The reason we do that is we can think of probability as a function that takes a state $x$ and returns a real number.

* $0 \leq p(x) \leq 1$
* $\sum_{x \in \mathcal{X}} p(x) = 1$

Let's look at some discrete distributions:

In [None]:
fig, ax = plt.subplots(1, 2)

ax[0].bar([1, 2, 3, 4],[0.25, 0.25, 0.25, 0.25], align='center')
ax[0].set_ylim([0, 1])
_ = ax[0].set_xticks([1, 2, 3, 4])
ax[0].set_title('Uniform distribution')

ax[1].bar([1, 2, 3, 4],[0, 1.0, 0, 0], align='center')
ax[1].set_ylim([0, 1])
_ = ax[1].set_xticks([1, 2, 3, 4])
ax[1].set_title('Degenerate distribution')

## Fundamental rules

### Probability of a union of two events

Given two events, $A$ and $B$, we define the probability of $A$ or $B$ as

$$
\begin{align}
P(A \lor B) &= P(A) + P(B) - P(A \land B) \\
&= P(A) + P(B) & \text{if $A$ and $B$ are mutually exclusive}
\end{align}
$$

### Joint probabilities

We define the probability of the joint event $A$ and $B$ as 

$$
P(A,B) = P(A \land B) = P(A|B)P(B)
$$

Given a **joint distribution** on two events p(A,B), we define the **marginal distribution** as

$$
P(A) = \sum_b P(A,B) = \sum_b P(A|B)P(B)
$$

### Conditional probability

We define the **conditional probability** of event $A$, given that event $B$ is true, as

$$
\begin{align}
P(A|B) &= \frac{P(A,B)}{P(B)} & \text{if $P(B) > 0$}
\end{align}
$$

### Joint and conditional probability for discrete random variables

We can extend joint and conditional probability of binary events to discrete random variables, similarly to above.

For two random variables $X$ and $Y$, the probability that $X=x$ and $Y=y$ is written as $P(X=x, Y=y)$ or $p(x,y)$ for short.

If we consider only instances where $X=x$, then 
the fraction of instances where $Y=y$ is the conditional probability which is written as $p(y|x)$ for short.

## Continuous random variables

Suppose $X$ is some uncertain continuous quantity. The probability that $X$ lies in any interval $a \leq X \leq b$ can be computed as follows. Define the events $A = (X \leq a), B = (X \leq b)$ and $W = (a < X \leq b)$. We have that $B = A \vee W$, and since $A$ and $W$ are mutually exclusive, the sum rule gives

$$P(B) = P(A) + P(W)$$

and hence

$P(W) = P(B) - P(A)$

Define the function $F_X(x) \triangleq p(X \leq x)$. This is called the **cumulative distribution function** or **cdf** of $X$. This is a monotonically non-decreasing function.

In [None]:
# CDF of Gaussian N(0,1)
import scipy.stats as stats
f = lambda x : stats.norm.cdf(x, 0, 1)
x = np.arange(-3, 3, 0.1)
y = f(x)

plt.plot(x, y, 'b')
plt.title('CDF')

Using the above notation, we have
$$P(a < X \leq b) = F_X(b) - F_X(a)$$

Now define $f(x) = \frac{d}{dx} F_X(x)$ (we assume this derivative exists); this is called a **probability density function** or **pdf**. Given a pdf, we can compute the probability of a continuous variable being in a finite interval as follows:

$$P(a < X \leq b) = \int_a^b f(x) dx$$

In [None]:
# PDF of Gaussian N(0,1)
# shaded area has 0.05 of the mass
# also written mu +/- 2 \sigma
f = lambda x : stats.norm.pdf(x, 0, 1)
x = np.arange(-4, 4, 0.1)
y = f(x)

plt.plot(x, y, 'b')
l_x = np.arange(-4, -1.96, 0.01)
plt.fill_between(l_x, f(l_x))
u_x = np.arange(1.96, 4, 0.01)
plt.fill_between(u_x, f(u_x))

plt.title('PDF')

We require that the density $f(x) \geq 0$, but it is possible for $f(x)>1$ for any given $x$, so long as it integrates to 1.

In [None]:
# Example of p(x) > 1, Uniform distribution on (0, 0.5)
f = lambda x: stats.uniform.pdf(x, 0, 0.5)
x = np.arange(-0.5, 1, 0.01)
y = f(x)

plt.plot(x, y, 'b')
plt.title('Uniform PDF')