# Probability and Information Theory

First, the laws of probability tell us how AI systems should reason, se we design our algorithms to compute or approximate various expressions derived using probability theory.

Second, we can use probability and statistics to theorically analyze the behavior of proposed AI systems.

## Probability
Machine learning must always deal with uncertain quantities and sometimes stochastic quantities.

There are three possible sources of uncertainty:
* Inherent stochasticity in the system being modeled. For instance, most interpretations of quantum mechanics describe the dynamics of subatomic particles as beign probabilistic.
* Incomplete observability. Even deterministic systems can appear stochastic when we cannot observe all the variables that drive the behavior of the system.
* Incomplete modeling. When we use a model that must discard some of the information we have observed, the discarded information results in uncertainty in the model's prediction.

While it should be clear that we need a means of representing and reasoning about uncertainty, it is not immediately obvious that probability theory can provide all the tools we want for AI applications. Probability theory was orginally developed to analyze the frequencies of (repeatable) events. However in the case of a doctor diagnosing the patient, we use probability to represent a **degree of belief**.

The probability direcly related to the rates at which events occur, is known as **Frequentist probability**, while the one related to qualitative levels of certainty is known as **Bayesian probability**.

Probability can be seens as the extension of logic to deal with uncertainty. Logic provides a set of formal rules for determining what propositions are implied to be true or false given the assumption that some other set of propositions is true or false. Probability theory provides a set of formal rules for determining the likelihood of a proposition being true given the likelihood of other propositions.

## Probability distributions

A probability distribution is a description of how likely a random variable or set of random variables is to take on each of its possible states.
A random variable is a variable that can take on different value randomly.

### Discrete Variables and Probability Mass Functions
The probability mass function (PMF) maps from a state of a random variable to the probability of that random variable taking on that state.

The probability that $\text{x} = x $ is denoted as $P(x)$ or $P(\text{x} = x)$ or $\text{x} \sim P(x)$

Probability mass functions can act on many variabes at the same time. They're called **joint probability distribution**.
$P(\text{x}=x,\text{y}=y)$ denotes the probability that $\text{x}=x$ and $\text{y}=y$ simultaneously.

To be a PMF on a random variable x, a function $P$ must satisfy the following properties:
* the domain of $P$ must be the set of all possible states of x
* $\forall x \in, 0 \le P(x) \le 1$. An impossible event has a probability 0, and no state can be less probable than that. Likewise, an event that is garanted to happen has a probability of 1, and no state can have a greater chance of occuring.
* $\sum_{x \in \text{x}} P(x) =1$. We refer to this property as being **normalized**. Without this property, we could obtain probabilities greater than one by computing the probability of one of many events occurring.

### Continuous Variables and Probability Density Functions

To be a probability density function (PDF), a function $p$ must satisfy the following properties:
* The domain of $p$ must be the set of all possible states of x
* $\forall x \in \text{x}, p(x) \ge 0$. Note that we do not require $p(x) \le 1$
* $\int p(x) dx = 1$

A probability density function $p(x)$ does not give the probability of a specific state directly, instead the probability of landing an infinitesimal region with volume $\delta x$ is given by $\int_{[a,b]} p(x)dx$

## Marginal Probability

Sometimes we know the probability distribution over a set of variables and we want to know the probability distribution over just a subset of them. The probability distribution over the subset is known as the **marginal probability distribution**.

For example, suppose we have discrete random variables x and y, and we know $P(\text{x,y})$. We can find $P(\text{x})$ with the **sum rule**:

$$
\forall x \in \text{x}, P(\text{x},x) = \sum_y P(\text{x}=x, \text{y}=y)
$$

THe name **marginal probability** comes from the process of computing marginal probabilities on paper. WHen the values of $P(\text{x,y})$ are written in a grid with different values of $x$ in rows and different values of $y$ in columns, it is natural to sum across a row of the grid, then write $P(x)$ in the margin of the paper just to the right of the row.

For continuous variables, we need to use integration instead of summation:

$$
p(x) = \int p(x,y)dy
$$

In [15]:
import numpy as np
a = np.array([1, np.nan])
np.array_equal(a,a)

False

In [16]:
import numpy as np
np.empty([3,4])

np.array_equal(np.empty([3,4]),np.empty([3,4]))

False