## Entrophy
The __entropy__ quantifies the amount of uncertainty involved in the value of a random variable or the outcome of a random process. Specifically, the Shannon entropy $H$, in units of bits (per symbol), is given by 
$$H=-\sum _{i}p_{i}\log _{2}(p_{i})$$
where $p_i$ is the probability of occurrence of the i-th possible value of the source symbol. This equation gives the entropy in the units of "bits" because it uses a logarithm of base 2, and this base-2 measure of entropy has sometimes been called the __"shannon"__ in his honor. Entropy is also commonly computed using the __natural logarithm__ (base e, where e is Euler's number), which produces a measurement of entropy in __"nats"__ and sometimes simplifies the analysis by avoiding the need to include extra constants in the formulas. Other bases are also possible, but less commonly used.


### Entropy of an information source
The entropy of a source that emits a sequence of $N$ symbols that are independent and identically distributed (iid) is $N ⋅ H$ bits (per message of N symbols). If the source data symbols are identically distributed but not independent, the entropy of a message of length N will be less than $N ⋅ H$.

For example, if one transmits 1000 bits (0s and 1s), and the value of each of these bits is known to the receiver (has a specific value with certainty) ahead of transmission, it is clear that no information is transmitted. If, however, each bit is independently equally likely to be 0 or 1, 1000 shannons of information (more often called bits) have been transmitted. Between these two extremes, information can be quantified as follows. If $𝕏$ is the set of all messages $\{x_1, ..., x_n\}$ that $X$ could be, and $p(x)$ is the probability of some $x \in \mathbb {X}$ , then the entropy H of X is defined as:
$$H(X)=\mathbb {E} _{X}[I(x)]=-\sum _{x\in \mathbb {X} }p(x)\log p(x)$$
Here, I(x) is the self-information, which is the entropy contribution of an individual message, and $𝔼_X$ is the expected value. A property of entropy is that it is maximized when all the messages in the message space are equiprobable $p(x) = 1/n$; i.e., most unpredictable, in which case $H(X) = \log n$.

The special case of information entropy for a random variable with two outcomes is the binary entropy function, usually taken to the logarithmic base 2, thus having the shannon (Sh) as unit:

$$H_{\mathrm {b} }(p)=-p\log _{2}p-(1-p)\log _{2}(1-p)$$

### Joint Entrophy
For random variables $X_{1},...,X_{n}$, the joint entropy of these random variables is the entropy of their joint
$$\mathrm {H} (X_{1},...,X_{n})=-\sum _{x_{1}\in {\mathcal {X}}_{1}}...\sum _{x_{n}\in {\mathcal {X}}_{n}}P(x_{1},...,x_{n})\log _{2}[P(x_{1},...,x_{n})]$$
Respectively, $P(x_{1},...,x_{n})$ is the probability of these values occurring together, and $P(x_{1},...,x_{n})\log _{2}[P(x_{1},...,x_{n})]$ is defined to be 0 if $P(x_{1},...,x_{n})=0$
This implies that if random variables are independent from each other, then their joint entropy is the sum of their individual entropies.

Note that
1. The Joint Entrophy is greater than individual entropies, $i.e.$
$$\mathrm {H} {\bigl (}X_{1},\ldots ,X_{n}{\bigr )}\geq \max _{1\leq i\leq n}{\Bigl \{}\mathrm {H} {\bigl (}X_{i}{\bigr )}{\Bigr \}}$$
2. The Joint Entrophy is less than or equal to the sum of individual entropies, $i.e.$
$$\mathrm {H} (X,Y)\leq \mathrm {H} (X)+\mathrm {H} (Y)$$

### Conditional Entrophy
The conditional entropy or conditional uncertainty of X given random variable Y (also called the equivocation of X about Y) is the average conditional entropy over Y:
$$H(X|Y)=\mathbb {E} _{Y}[H(X|y)]=-\sum _{y\in Y}p(y)\sum _{x\in X}p(x|y)\log p(x|y)=-\sum _{x,y}p(x,y)\log p(x|y)$$

Because entropy can be conditioned on a random variable or on that random variable being a certain value, care should be taken not to confuse these two definitions of conditional entropy, the former of which is in more common use. A basic property of this form of conditional entropy is that:
$$H(X|Y)=H(X,Y)-H(Y)$$
So we could say that
$$\mathrm {H} (X_{1},\dots ,X_{n})=\sum _{k=1}^{n}\mathrm {H} (X_{k}|X_{k-1},\dots ,X_{1})$$

### Mutual information
Mutual information quantifies the "amount of information"  obtained about one random variable through observing the other random variable. MI is more general and determines how different the joint distribution of the pair $(X,Y)$ is to the product of the marginal distributions of $X$ and $Y$ because it actually is the expected value of the __pointwise mutual information__ (PMI). It is also known as __information gain__. __Kullback–Leibler divergence__ is a measure of how one probability distribution is different from a second, reference probability distribution.

Specifically, Let $(X,Y)$ be a pair of random variables with values over the space ${\mathcal {X}}\times {\mathcal {Y}}$. If their joint distribution is $P_{(X,Y)}$ and the marginal distributions are $P_{X}$ and $P_{Y}$, the mutual information is defined as
$$I(X;Y)=D_{\mathrm {KL} }(P_{(X,Y)}\|P_{X}\otimes P_{Y})$$

Notice, as per property of the Kullback–Leibler divergence, that $I(X;Y)$ is equal to zero precisely when $X$ and $Y$ are independent. In general $I(X;Y)$ is __non-negative__ and __symmetric__, it is a measure of the price for encoding $(X,Y)$ as a pair of independent random variables, when in reality they are not.

$$\begin{align*}
I (X;Y) & \equiv \mathrm {H}(X)-\mathrm {H} (X|Y)\\
& \equiv \mathrm {H} (Y)-\mathrm {H} (Y|X) \\
& \equiv \mathrm {H} (X)+\mathrm {H} (Y)-\mathrm {H} (X,Y)\\
&\equiv \mathrm {H} (X,Y)-\mathrm {H} (X|Y)-\mathrm {H} (Y|X) \\
& \equiv \mathbb {E} _{p(y)}[D_{\mathrm {KL} }(p(X|Y=y)\|p(X)].
\end{align*}$$

### Cross Entrophy
Cross entropy can be interpreted as the expected message-length per datum when a wrong distribution $q$ is assumed while the data actually follows a distribution $p$. That is why the expectation is taken over the true probability distribution $p$ and not $q$. Indeed the expected message-length under the true distribution $p$ is,

$$\operatorname {E} _{p}[l]=-\operatorname {E} _{p}\left[{\frac {\ln {q(x)}}{\ln(2)}}\right]=-\operatorname {E} _{p}\left[\log _{2}{q(x)}\right]=-\sum _{x_{i}}p(x_{i})\,\log _{2}{q(x_{i})}=-\sum _{x}p(x)\,\log _{2}q(x)=H(p,q)$$