# Probability and Infromation Theory


___Scraped from MIT deep learning book___ by Ian Goodfellow, Yoshua Bengio & Aaron Courville

Probability theory is a mathematical framework for representing uncertain
statements. It provides a means of quantifying uncertainty and axioms for deriving
new uncertain statements. In artificial intelligence applications, we use probability
theory in two major ways. First, the laws of probability tell us how AI systems
should reason, so we design our algorithms to compute or approximate various
expressions derived using probability theory. Second, we can use probability and
statistics to theoretically analyze the behavior of proposed AI systems. 

Many branches of computer science deal mostly with entities that are entirely
deterministic and certain. A programmer can usually safely assume that a CPU will
execute each machine instruction flawlessly. Errors in hardware do occur, but are
rare enough that most software applications do not need to be designed to account
for them. Given that many computer scientists and software engineers work in a
relatively clean and certain environment, it can be surprising that machine learning
makes heavy use of probability theory

There are three possible sources of uncertainty:
1. Inherent stochasticity in the system being modeled. For example, most interpretations of quantum mechanics describe the dynamics of subatomic particles as being probabilistic. We can also create theoretical scenarios that we postulate to have random dynamics, such as a hypothetical card game where we assume that the cards are truly shuffled into a random order
2. Incomplete observability. Even deterministic systems can appear stochastic when we cannot observe all of the variables that drive the behavior of the system. For example, in the Monty Hall problem, a game show contestant is asked to choose between three doors and wins a prize held behind the chosen door. Two doors lead to a goat while a third leads to a car. The outcome given the contestant’s choice is deterministic, but from the contestant’s point of view, the outcome is uncertain.
3. Incomplete modeling. When we use a model that must discard some of the information we have observed, the discarded information results in uncertainty in the model’s predictions. For example, suppose we build a robot that can exactly observe the location of every object around it. If the robot discretizes space when predicting the future location of these objects, then the discretization makes the robot immediately become uncertain about the precise position of objects: each object could be anywhere within the discrete cell that it was observed to occupy

`"In many cases, it is more practical to use a simple but uncertain rule rather
than a complex but certain one, even if the true rule is deterministic and our
modeling system has the fidelity to accommodate a complex rule. For example, the
simple rule “Most birds fly” is cheap to develop and is broadly useful, while a rule
of the form, “Birds fly, except for very young birds that have not yet learned to
fly, sick or injured birds that have lost the ability to fly, flightless species of birds
including the cassowary, ostrich and kiwi. . .” is expensive to develop, maintain and
communicate, and after all of this effort is still very brittle and prone to failure."`

`degree of belief`:we use probability to represent a
degree of belief, with 1 indicating absolute certainty that the patient has the flu
and 0 indicating absolute certainty that the patient does not have the flu.(in a doctor diagnosis example)
`frequentist probability`:kind of probability, related directly to the rates at which events occur
`Bayesian probability`: probability related to qualitative levels of uncertainity


Probability can be seen as the extension of logic to deal with uncertainty. Logic
provides a set of formal rules for determining what propositions are implied to
be true or false given the assumption that some other set of propositions is true
or false. Probability theory provides a set of formal rules for determining the
likelihood of a proposition being true given the likelihood of other propositions.

Probability can be seen as the extension of logic to deal with uncertainty. Logic
provides a set of formal rules for determining what propositions are implied to
be true or false given the assumption that some other set of propositions is true
or false. Probability theory provides a set of formal rules for determining the
likelihood of a proposition being true given the likelihood of other propositions.

## Random Variables

A random variable is a variable (function is a better representation maybe?) that can take on different values randomly. On
its own, a random variable is just a description of the states that are possible; it
must be coupled with a probability distribution that specifies how likely each of
these states are.
Random variables may be discrete or continuous. A discrete random variable
is one that has a finite or countably infinite number of states.  A continuous random variable is
associated with a real value.


## Probability Distributions

A `probability distribution` is a description of how likely a random variable or
set of random variables is to take on each of its possible states. The way we
describe probability distributions depends on whether the variables are discrete or
continuous.


### Discrete Variables and Probability Mass Functions

A probability distribution over discrete variables may be described using a ___probability mass function (PMF)___.

The probability mass function maps from a state of a random variable to
the probability of that random variable taking on that state. The probability
that x = x is denoted as P(x), with a probability of 1 indicating that x = x is
certain and a probability of 0 indicating that x = x is impossible.

Probability mass functions can act on many variables at the same time. Such
a probability distribution over many variables is known as a ___joint probability
distribution___

To be a probability mass function on a random variable x, a function P must
satisfy the following properties:
- The domain of P must be the set of all possible states of x.
- ∀x ∈ x,0 ≤ P(x) ≤ 1. An impossible event has probability 0 and no state can be less probable than that. Likewise, an event that is guaranteed to happen has probability 1, and no state can have a greater chance of occurring.
- $\sum_{x∈x} P(x) = 1.$ We refer to this property as being normalized. Without this property, we could obtain probabilities greater than one by computing the probability of one of many events occurring.

### Continuous Variables and Probability Density Functions


When working with continuous random variables, we describe probability distributions using a __probability density function (PDF)___ rather than a probability
mass function.

To be a probability density function, a function p must satisfy the
following properties:
- The domain of p must be the set of all possible states of x. 
- ∀x ∈ x, p(x) ≥ 0. Note that we do not require p(x) ≤ 1.
- $\int$ p(x)dx = 1

A probability density function p(x) does not give the probability of a specific
state directly, instead the probability of landing inside an infinitesimal region with
volume δx is given by p(x)δx. We can integrate the density function to find the actual probability mass of a
set of points. Specifically, the probability that x lies in some set S is given by the
integral of p(x) over that set.

![](img/pdfpmf.PNG)

## Marginal Probability

Sometimes we know the probability distribution over a set of variables and we want
to know the probability distribution over just a subset of them. The probability
distribution over the subset is known as the marginal probability distribution.

For example, suppose we have discrete random variables x and y, and we know
P(x, y). We can find P(x) with the sum rule:

∀x ∈ x,$P$(x = $x$) = $\sum_y$
$P$(x = $x$, y = $y$).

For continuous variables, we need to use integration instead of summation:
$p(x)$ =$\int$
$p$($x, y$)$dy$. 

## Conditional Probability

![](img/condition.PNG)


## The Chain Rule of Conditional Probabilities

Any joint probability distribution over many random variables may be decomposed
into conditional distributions over only one variable:
![](img/chainprob.PNG)

This observation is known as the chain rule or product rule of probability. It follows immediately from the definition of conditional probability in equation 3.5



##  Independence and Conditional Independence

![](img/indep.PNG)

## Expectation, Variance and Covariance

The expectation or expected value of some function f(x) with respect to a
probability distribution P(x) is the average or mean value that f takes on when x
is drawn from P.

![](img/expect.PNG)

![](img/varcovar.PNG)

measures such as correlation normalize the
contribution of each variable in order to measure only how much the variables are
related, rather than also being affected by the scale of the separate variables.

![](img/covar.PNG)

## `Common Probability Distributions`

### Bernoulli Distribution

The Bernoulli distribution is a distribution over a single binary random variable.
It is controlled by a single parameter φ ∈ [0, 1], which gives the probability of the
random variable being equal to 1.

![](img/berndist.PNG)


### Multinoulli Distribution

The multinoulli or categorical distribution is a distribution over a single discrete
variable with k different states, where k is finite.

The multinoulli distribution is parametrized by a vector $p ∈ [0, 1]^{k−1}$
, where pi gives the probability of the i-th
state. The final, k-th state’s probability is given by $1 − 1^Tp$. Note that we must
constrain $1^Tp ≤ 1$. Multinoulli distributions are often used to refer to distributions
over categories of objects, so we do not usually assume that state 1 has numerical
value 1, etc. For this reason, we do not usually need to compute the expectation
or variance of multinoulli-distributed random variables.

### `Gaussian Distribution`

The most commonly used distribution over real numbers is the normal distribution, also known as the Gaussian distribution:
![](img/gauss.PNG)
![](img/gauss2.PNG)


The normal distribution is a good default choice for two major reasons:

- First, many distributions we wish to model are truly close to being normal distributions. The central limit theorem shows that the sum of many independent random variables is approximately normally distributed. This means that in practice, many complicated systems can be modeled successfully as normally distributed noise, even if the system can be decomposed into parts with more structured behavior
- Second, out of all possible probability distributions with the same variance, the normal distribution encodes the maximum amount of uncertainty over the real numbers.

![](img/gauss3.PNG)
![](img/gauss4.PNG)

## Exponential and Laplace Distributions

![](img/laplace.PNG)
