# Probablity and Information Theory 

Probablity theory is a mathematical framework for __representing__ uncertain statements. In AI we use probablity theory for two main reasons:

* It tells us how AI systems should reason, we design our algorithms to compute or approximate various expressions derived using probablity theory. For example, it is likekly that it is late in the night if given an image of a landscape the sky looks extremely dark. Thus, for most geographical locations the probablity of it being night given a dark sky should be high.

* We can use probablity and statistics to theoretically analyze the behaviour of proposed AI systems.

Whilst, probablity theory allows us to make uncertain statements and reason in the presence of uncertainty. __Information Theory__ helps us quantify the amount of uncertainty in a probablity distribution


## 3.1 Why Probablity?

The fact that machine learning has to deal with uncertain quantities and at times stochastic (non-deterministic/random) quantities causes it to require probablity theory to provide a framework for us to reason about these quantities. In fact, all propositions (except mathematical statements) have some uncertainty arising from:

* Stochasticity of the system being modeled. For example, we can create theoretical scenarios that we postulate to have random dynamics, such as a hypotheitcal card game where we assume the cards are truly shuffled into a random order.

* Incomplete Observability. Even Deterministic systems can appear stochastic when we cannot observe all the variables that drive the system. For example, in the Monty Hall problem, a game show contestant is asked to choose from three doors. Two doors lead to goats and the third leads to a car. The enivorment is completely deterministic but the stochasticity comes from the contestant not being able to look behind the doors beforehand.

* Incomplete modeling. When we model a system we discard some information about it which leads to uncertainty in our model's predictions (due to our model being incomplete). For example, suppose we build a robot that can exactly observe the location of every object around it. If the robot discretizes space when predicting the future location of these objects, then this discretization makes the robot uncertain about the precise location of these objects.

In many cases it might be more practical to use a simple but uncertain rule rather than a complex but certain one. This is mostly due to the fact complex rules are harder to develop/maintain and can themselves be brittle/prone to failure. So any additonal determinism might not be worth extra effort.

There are two main definitions of probablity:

* __Frequentist Probablity__  sees probability as the long-run expected frequency of occurance. With the frequentist view $P(A)$ measure the proportion of repititions we would observe $A$ in were we to repeat an experiment infinitely many times. For example, if we were to throw a fair dice an infinite number of times we would expect to see a three, in $\dfrac{1}{6}$-th of the repititions. 

* __Bayesian Probablity__ sees probablity as representing a __degree of belief__ with 1 indicating absolute certainty about the occurance of an event and 0 indicating absolute certainty about the non-occurance of the event. Thus, if a doctor is $40%$ certain that a patient has the flu, they (the doctor) are their degree of belief that the patient has the flu is 0.4.

Frequentist Probablity relates to rates at which events occurs whereas Bayesia Probablity is related to qualitative levels of certainty. However, the same axioms control both approaches (Ramsey 1926).

## 3.2 Random Variables

A __random variable__ is a variable that can take on different values randomly. A random variable is just a description of the states that are possible; it must be coupled with a probablity distribution that specifies how likely each of these states are. There are two types of random variables:

* Discrete - Have finite or countably infinite values.
* Continuous - Associated with a real value (by definition uncountably infinite)


## 3.3 Probablity Distribution

A __probablity distribution__ is a description of how likely a random variable or a set of random variables is to take on its possible states. The way we describe a probablity distribution depends on whether the random variables are discrete or continuous.

### 3.3.1 Discrete Variables and Probablity Mass Function

A probablity distribution over _discrete random variables_ is described using a __probability mass function__ (PMF) denoted by $P$. The probablity that the random variable x takes a value $x$ is denoted by $P(x)$. A probability of 1 indicates that $x$ will certainly take place and 0 indicating that $x$ is impossible. We use ${\rm x} \like P({\rm x})$ to denot ${\rm x}$ follows the distribution $P({\rm x})$.

The probability mass function can act over many random variables, such a probability mass function is known as a __joint probability distribution__. P({\rm x} = x, {\rm y} = y) denotes the probability that ${\rm x} = x$ and ${\rm {\rm y} = y}$ simultaneously. We use $P(x, y)$ for brevity.

To be a PMF on a random variable ${\rm x}$, a function P must satisfy the follwing properties:

* The domain of P must be the set of all states of ${\rm x}$.
* $\forall x \in {\rm x}, 0 \le P(x) \le 1$
* $\sum_{x \in {\rm x}} P(x) = 1$. We refer to this property as being __normalized__.

Let's say ${\rm x}$ has k different states. We can place a __uniform distribution__ on ${\rm x}$ such that $\forall x \in {\rm x}, P(x) = \dfrac{1}{k}$.

### 3.3.2 Continuous Variables and Probability Density Functions

A probablity distribution over _continuous random variables_ is described using a __probability density function__ (PDF). A PDF, $p$ must satisfy the following properties:

* The domain of $p$ must be the set of all possible states of ${\rm x}$.
* $\forall x \in {\rm x}, p(x) \ge 0$ __Note:__ we do not require $p(x) \le 1$.
* $\in


## Exponential and Laplace Distribution

In deep learning we often want to have a probablity distribution with a sharp point at x = 0. The exponential distribution provides us with the machinery to be able to do this. We define the distribution as:

$ p(x; \lambda) = \lambda \boldsymbol{1}_{x \ge 0} \exp(-\lambda x) $

A closely related distribution probability distribution that allows to place a sharp peak of probablity mass at an arbitraty point $\mu$ is the __Laplace Distribution__

$ Laplace(x; \mu, \gamma) = \dfrac{1}{2\gamma} \exp(-\dfrac{|x - \mu|}{\gamma}) $

## The Dirac Distribution and the Emperical Distribution

In some cases we wish to specify that all mass in a probablity distribution clusters around a single point. This can be accomplished using a PDF using the __Dirac delta function__:

$ p(x) = \delta(x - \mu)$

This p(x) will be such that it will be 0 valued everywhere except at $ x = \mu $.

A common use of the Dirac delta function distribution is as a component of an __emperical distribution__.

### Emperical Distribution motivation

Consider a situation where we have iid data $x_{i}$ from some unknown distribution. One problem of
interest is estimating the distribution that is generating the data. There are many useful examples
of this abstract problem, including:

* We are analyzing a telephone call center, and we measure the amount of time required to provide service to each incoming phone call. Though an exponential distribution is a commonly used model, it can be useful to estimate the distribution of the call lengths to help validate the use of an exponential distribution model. In some cases, we may find by using the data that an exponential distribution is not a good model.

* We are analyzing midterm scores within a class, and would like to gain a better understanding of how well the class scored on the exam

Within the problem of estimating the distribution using iid $x_{i}$, there are two distinct problems:

1. Estimate the cdf Fx(u).
2. Estimate the pdf fx(u).

We will consider each of these problems. We will drop the x subscript to simply the notation.

means that with probability 1 the discrepancy between Fˆ(u) and F(u) goes to 0.


## Things to remember about sigmoid and softplus functions

* $\sigma(x) = \dfrac{\exp(x)}{\exp(x) + \exp(0)}$
* $\dfrac{d\sigma(x)}{dx} = \sigma(x)(1 - \sigma(x))$
* $1 - \sigma(x) = \sigma(-x)$
* $log \sigma(x) =  -\varsigma(-x)$
* $\dfrac{d\varsigma(x)}{dx} = \sigma(x)$ 
* $\forall x \in (0, 1), \sigma^{-1}(x)  = log(\dfrac{x}{1 - x})$
* $\forall x \ge 0, \varsigma^{-1}(x)  = log(\exp(x) - 1)$
* $\varsigma(x) = \int_{-\infty}^{x}\sigma(y) dy$
* $\varsigma(x) - \varsigma(-x) = x$

In [None]:

     1c v