# Statistics

## Bayesian inference

**Bayesian inference** is a method of statistical inference in which Bayes' theorem is used to update the probability for a hypothesis as more evidence or information becomes available.

**Statistical inference** is the process of deducing properties of an underlying distribution by analysis of data. Inferential statistical analysis infers properties about a population: this includes testing hypotheses and deriving estimates. The population is assumed to be larger than the observed data set; in other words, the observed data is assumed to be sampled from a larger population.

**Descriptive statistics** is solely concerned with properties of the observed data, and does not assume that the data came from a larger population.

## Prior probability in Bayesian inference

**Prior probability distribution = the prior** of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into account.

**Prior probability** of a random event or an uncertain proposition is the unconditional probability that is assigned before any relevant evidence is taken into account.

### Informative prior
expresses specific, definite information about a variable. An example is a prior distribution for the temperature at noon tomorrow. A reasonable approach is to make the prior a normal distribution with expected value equal to today's noontime temperature, with variance equal to the day-to-day variance of atmospheric temperature, or a distribution of the temperature for that day of the year.

This example has a property in common with many priors, namely, that the posterior from one problem (today's temperature) becomes the prior for another problem (tomorrow's temperature); pre-existing evidence which has already been taken into account is part of the prior and, as more evidence accumulates, the posterior is determined largely by the evidence rather than any original assumption, provided that the original assumption admitted the possibility of what the evidence is suggesting.

### Uninformative prior = diffuse prior = not very informative prior = objective prior
expresses vague or general information about a variable. e.q. variable is positive.

The simplest rule for determining a non-informative prior is the **principle of indifference**, which assigns equal probabilities to all possibilities. = Uniform distribution.

***A priori* probability** = probability that is derived purely by deductive reasoning = from general knowledge about the data distribution before making an inference. Example: If there are N mutually exclusive and exhaustive events and if they are equally likely, then the probability of a given event occurring is 1/N.
e.g. Each face of the dice appears with equal probability = $\frac{1}{6}$



https://en.wikipedia.org/wiki/Prior_probability

## Bayes theorem

The prior probability represents our degree of belief
before the data arrives. After we observe the data, we can use Bayes' theorem
to convert this prior probability into a posterior probability. 

We want to clasify a new character from an image into one of $n$ discrete classes $C_k$.

##### 1) Before seeing the image that we want to classify.

$P(C_k)$ = *prior* probability of an image belonging to each of the classes $C_k$ = if the letter 'a' occurs three times as often as the letter 'b' we have $P(C_1) = 0.75$ and $P(C_2) = 0.25$. (Assuming these are the only two possible letters = classes.)

##### 2) We have measured the value of the feature variable $x$ for the image. 

$x$ is assigned to one of a discrete set of values {X}. 

$P(C_k,\ X)$ = the joint probability is defined to be the probability that the image has the feature value $X$ and belongs to class $C_k$

$$
P(C_k,\ X) = P(C_k\lvert X) * P(X) 
$$

$$
P(C_k,\ X) = P(X \lvert C_k) * P(C_k)  
$$

$$
\text{posterior probability = } P(C_k\lvert X) = \frac{\text{class conditional probability } P(X \lvert C_k) * \text{ prior probability }P(C_k) }{\text{normalization factor } P(X)}
$$

Assume that only two classes exist: $C_1$ and $C_2$, then:

$$
P(C_1\lvert X) + P(C_2\lvert X) = 1
$$

$$
\frac{ P(X \lvert C_1) * P(C_1) }{P(X)} + \frac{ P(X \lvert C_2) * P(C_2) }{P(X)} = 1
$$

$$
P(X \lvert C_1) * P(C_1) + P(X \lvert C_2) * P(C_2) = P(X)
$$

Resulting posterior probability of class $C_k$ given $X$ if $n$ classes exist can be calculated as:

$$
P(C_k\lvert X) = \frac{ P(X \lvert C_k) * P(C_k) }{P(X \lvert C_1) * P(C_1) + P(X \lvert C_2) * P(C_2) + \cdots} = \frac{ P(X \lvert C_k) * P(C_k) } {\sum_{i=1}^n P(X \lvert C_i) * P(C_i) }
$$

##### 3) Choose class $C_k$ with highest posterior probability.

Source:

http://cs.du.edu/~mitchell/mario_books/Neural_Networks_for_Pattern_Recognition_-_Christopher_Bishop.pdf
Ctrl+F "joint probability"