# Probability

"Probability theory is a mathematical framework for **representing uncertain statements**. It provides a means of **quantifying uncertainty** as well as axioms for deriving new uncertain statements. In Artificial Intelligence, we use probability theory in two major ways. First, the laws of probability tell us how AI systems should reason, so we design our algorithms to **compute or approximate various expressions** derived using probability theory. Second, we can use probability and statistics to theoretically **analyze the behavior of proposed AI systems**."

## Random Variable

A **random variable** is a variable that can take on different values randomly.  
It must be coupled with a probability distribution.  
It can be **discrete** or **continuous**.

## Probability Distributions

A **probability distribution** is "how likely a random variable is to take on each of its possible states".  
The description is different for discrete and continuous variables.  

### Discrete variables and Probability Mass Functions

A probability distribution over **discrete** variables may be described with **Probability Mass Function (PMF)**. 

It can act on many variables at the same time and is known in this case as **joint probability distribution** $p(\mathbf{x} = x, \mathbf{y} = y)$ that we denote $p(x, y)$


To be a PMF it must satisfy the following : 
 - The domain of $p$ must be the set of all possible states of $x$
 - $\forall x \in x, 0 \leq p(x) \leq 1$
 - $\sum_{x \in \mathbf{x}}p(x) = 1$ which means that it is **normalized**.

Consider a single discrete random variable $x$ with $k$ different states.  
We can place a **uniform distribution** on $x$ by setting its PMF to : 

\begin{equation*}
p(\mathbf{x}=x_i) = \frac{1}{k}
\end{equation*}

### Continuous Variables and Probability Density Functions

When working with **continuous variables**, we can use a **probability density function (PDF)** rather than a PMF.  
It must satisfy the following : 
 - //
 - $\forall x \in \mathbf{x}, p(x) \geq 0$, note that we do not require $p(x) \leq 1$
 - $\int p(x) dx = 1$
 
It doesn't give the probability of a specific directly.  
The probability that $x$ lies in some set $S$ is given by the integral of $p(x)$ over that set.

## Marginal Probability

Sometimes we want to know the probability distribution just over a subset of the variables.
It is known as the **marginal probability distribution**.

For example, we have two random discrete variables $x$ and $y$ and we know $P(x, y)$.  
We can find $P(x)$ with the sum rule.

\begin{equation*}
\forall x \in \mathbf{x}, P(\mathbf{x}=x) = \sum_y P(\mathbf{x}=x, \mathbf{y}=y)
\end{equation*}

For continuous variables, we need to use the integration instead : 

\begin{equation*}
p(x) = \int p(x, y) dy
\end{equation*}

## Conditional Probability

Sometimes we are interested on the probability of some event given that another event occured, this is called **conditional probability**.  
It can be computed with : 


\begin{equation*}
P(\mathbf{y}=y | \mathbf{x}=x) = \frac{P(\mathbf{y}=y, \mathbf{x}=x)}{P(\mathbf{x}=x}
\end{equation*}

It can only be computed when $P(\mathbf{x}=x) > 0$

**Attention**, it is different from computing the consequence of an action, which is called : **intervention query** and is in the domain of **causal modeling**.

## Independence and Conditional Independence

Two random variables $x$ and $y$ are **independent** if their probability distribution can be expressed as a product of two factors.  

Two random variables $x$ and $y$ are **conditionally independent** given a random variable $z$ if the conditional probability distribution over $x$ and $y$ factorizes in this way for every value of $z$.