# Shannon's entropy
In information theory, the entropy of a random variable is the average level of "information" inherent to the variable's possible outcomes. 

The American mathematician
Claude Elwood Shannon, was interested in how
to mathematically model the process of
information transmission.
 
One of the main problems
that Shannon encountered was how to measure the information
content of a message. He published his
findings in a 1948 article, "A Mathematical Theory
of Communication" which has become one of the most cited
scientific works in history and founded the field of information theory.  

Shannon realized that the information content of a message or of any random variable is best thought of as **how
surprising its content is, how much uncertainty
it resolves.** As the basic measure of
information content, he introduced the unit bit.
 
If you want to capture the
information content of a random variable with N different outcomes that
are all equally likely, how many bits does it take? Shannon showed that the answer is given by the
l**ogarithm of N with base 2**. At the suggestion of John von Neumann, he called
his measure entropy following the notion of
entropy that Boltzmann had introduced in
thermodynamics and physics. Entropy is denoted with the
uppercase Greek letter Eta, which is written like
the English letter H.  
  
**H = log2(N)**   
  
Let's see some example.

In [4]:
from math import log

In [2]:
log(1,2) #log base 2 of 1

0.0

If the outcome of a random variable can only take on one possible value
with probability one, then you will not
learn anything new. You already know the outcome
it was going to take. So the information content of
that random variable is zero. 

## Entropy formula
We can encapsulate the formula into an easier Python function which requires only the number of possible outcomes:

In [2]:
def entropy (n, logbase=2):
    # Calculate the Shannon Entropy given random variable with equally likely outcomes
    # n = number of possible outcomes of the random variable; n > 0
    # logbase = base of the logarithm; default is 2 (for digital bits)
    # returns the Shannon entropy H
    
    return log(n,logbase)

In [27]:
entropy(1) # same as log(1,2)

0.0

In [74]:
entropy(2)

1.0

The information content of a random variable
that can take on two outcomes with
equal probability (this could be e.g. the outcome of a fair coin toss) can be captured in one bit.  
So you can understand the basic unit of Shannon's entropy as the information contained in the outcome of a coin toss.    
  
And so on:

In [28]:
entropy(4) # entropy of a variable with four possible outcomes is two

2.0

Note that if H=log2(N) then N = 2^H  
That means 1 bit can contain all the information needed for a toss coin outcome.  
  
This can be used to claculate how many bits are necessary to contain an event information.  
For example, 2 bits can contain / report all the information needed for an event with 4 possible outcomes.  
## Fractional bits

In [29]:
entropy(10) # entropy of a variable with 10 possible outcomes

3.3219280948873626

The information content captured by one digit in our
decimal system - meaning any digit from 0 to 9 - is the base 2 logarithm
of 10 and is 3.32.  
How should we think
of fractional bits?  
Well, one way of thinking
about it is that if I have a long sequence
of decimal digits, I can piece together
fractional bits to express the information
content of the entire sequence.  
  
So for example, if I have
three decimal digits, with them I can capture any
number from 0 to 999.  
And the information content is the
base 2 logarithm of 1000 or 3 times the base 2 logarithm
of 10, which is 9.97.   
  
So I can encode a three-digit
decimal number by 10 bits.  
In fact, 10 bits could
capture up to 2^10, which is 1,024
different numbers (sounds familiar?). So there is 0.03 bits
of redundancy.

In [30]:
entropy(1000)

9.965784284662087

In [31]:
entropy(1024)

10.0

## Entropy formula given (equal) probability 
For a random variable *x* with equi-probable outcomes, the formula can also be written as:  
**H = log N = -log P** where P is the probability P = 1/N (for equi-probable outcomes)  
For example, instead of log2(10) i.e., using 10 possible outcomes, we could use the probability, which would be 1/10 and get the same entropy value:

In [75]:
log(10,2)

3.3219280948873626

In [33]:
-log(1/10,2)

3.321928094887362

Therefore, we can rewrite the function accepting the probability as input:

In [11]:
def entropyP (P, logbase=2):
    # Calculate the Shannon Entropy given random variable with equally likely outcomes
    # P is the probablity of each outcome, p>0
    return -log(P,logbase)

In [35]:
entropyP(0.1)

3.321928094887362

In [36]:
entropyP(0.5)  #same as entropy (2)

1.0

In [37]:
entropy(2)

1.0

### Extra 1: other logbase
Usually log2 is used in entropy and that implies the unit of information content are bits, which has advantages when talking about information managed by computers.  
Anyway we could use other logarithm bases; if we use base 10 we would get a measure that tells us how many decimal digits we need to capture the same information.  
  
  As an example:  
we have seen that we need at least 4 bits (entropy=3.32) to represent the information of a decimal number from 0 to 9 but we would need only 1 decimal number to represent it.  
In fact log10(10)=1 and the entropy (using base 10) of 10 is 1:

In [5]:
entropy(10) # standard definition: use base 2 and bits

3.3219280948873626

In [6]:
entropy(10,10) # entropy using decimal digits instead of bits: only 1 decimal digit needed

1.0

### Extra2: Entropy for not equal probability

Let's generalize this a bit, and let's assume that we have a random variable *x* with
N different outcomes that have different
probabilities Pi, that are all non-negative and add up to one.  
Then we can denote the Shannon information
content, or the surprisal, of a given outcome *i* as
minus the logarithm of Pi.  
  
If the probability of
outcome *i* is close to one, then we are not very
surprised by that outcome and the information content, or surprisal, is close to zero. On the other hand, if the probability of
an outcome is very low, then we will be very surprised, so the information content of that outcome is a large number.  
This should make
the definition of information content intuitive.  
  
  Let's see an example: how different is the entropy for a content with five possible outcomes that do not have the same probability.

In [54]:
entropy(5) # this is the entropy if all outcomes have the same probability

2.321928094887362

Now we make a list of five different probabilities, each one should represent the probability of outcome *i*, and sum up to one:

In [8]:
exampleDiffProb = [0.1, 0.2, 0.4, 0.2, 0.1]

In [9]:
sum(exampleDiffProb)  # expected to be 1.0, give or take a floating-point error.

1.0000000000000002

We can calculate the single entropies *Hi*:

In [12]:
ents = [p * entropyP(p) for p in exampleDiffProb]
ents

[0.33219280948873625,
 0.46438561897747244,
 0.5287712379549449,
 0.46438561897747244,
 0.33219280948873625]

In [13]:
print("Entropy for not equal probabilities = ", sum(ents))

Entropy for not equal probabilities =  2.121928094887362


Telling us the entropy is ~2.12 bits  
  
Note that it's lower. Now let's see a more interesting example: the entropy of a die roll:

In [55]:
entropy(6) # die has 6 possible outcomes

2.584962500721156

We can assume that a fair die has all equal probabilities for the outcomes:

In [15]:
fairDie = [1/6 for x in range(6)]
fairDie

[0.16666666666666666,
 0.16666666666666666,
 0.16666666666666666,
 0.16666666666666666,
 0.16666666666666666,
 0.16666666666666666]

What happens if it's an unfair die, so the probability to get a 6 is higher than the others?  
Let's modifiy the probabilities list and calculate its entropy:

In [16]:
unfairDie = fairDie
unfairDie[0] = fairDie[0] - 0.1
unfairDie[5] = fairDie[5] + 0.1
unfairDie

[0.06666666666666665,
 0.16666666666666666,
 0.16666666666666666,
 0.16666666666666666,
 0.16666666666666666,
 0.26666666666666666]

In [17]:
sum(unfairDie)

1.0

In [18]:
ents = [p * entropyP(p) for p in unfairDie]
print("Entropy for not equal probabilities = ", sum(ents))

Entropy for not equal probabilities =  2.49227186568361


The entropy for the unfair die is lower!  
Here is a general rule.  
The entropy is maximized when all outcomes are equally likely: **if all P(xi) values are equal, the entropy is at its maximum, indicating maximum uncertainty.**

## Information as decrease in uncertainty
Shannon observed that
information = a decrease in uncertainty.  
Given our measure of entropy, the information obtained from a signal or message is captured by:  
R = H_before - H_after (R stands for Ratio)  
  
Example: if you roll a die and the result is "the number is even" then the information obtained is:    
R = H(6) - H(3)

In [63]:
# entropy before, as the outcome can be any number between 1 and 6
entropy(6)

2.584962500721156

In [64]:
#entropy after, now we know it's even: can only be 2, 4 or 6 (one of 3 numbers)
entropy(3)

1.5849625007211563

In [65]:
entropy(6)-entropy(3)

0.9999999999999998

The information obtained is 1 bit, which is exactly the information of knowing if a die roll is odd or even:

In [66]:
# the entropy = information of knowing if it's odd or even
entropy(2)

1.0

## Joint entropy
Consider two random variables x and y, then the Joint entropy  
H (x,y) = −Sum(Pij * logPij)  
  
If x and y are independent, then Pij = Pi * Pj and so:  
H (x , y ) = H (x ) + H (y )  
  
Here is an example.  
Consider a random variable x with an entropy of 5 bits.  
Now let's consider another random variable y = +- square root(x) that is either plus or minus the square root of x with probability 50/50. What is the entropy of y?  
This is a case of joint entropy and it's the sum:  

In [67]:
5 + entropyP(0.5) 

6.0

The additional uncertainty about the sign adds one bit of entropy.  
Note that how is actually working the variable y (square root or whatever function of x) it doesn't matter. For the entropy it matters only the outcome probability.

Another example. Imagine you play a coin toss game in which you count how long it takes until you hit the first tail. What is the surprisal of the outcome "hitting a tail on the third toss"?  
They are all independent. From classical probability theory, we know this is probability 1/2 * 1/2 * 1/2 = 1/8  
So its entropy is:

In [68]:
entropyP(1/8)

3.0

In [81]:
#another way to get this is to use the joint entropy of three coin toss games: 
entropyP(0.5) + entropyP(0.5) + entropyP(0.5)

3.0

## Conditional entropy
Consider two random variables x and y, then the Conditional entropy of x given y is:    
H (x|y) = −sum(Pij * logP (x|y) )  
  
This is equal to the average uncertainty about x if y is known.    
We can also express it as the difference between the joint entropy and the entropy of y:  
H(x|y) = H(x,y) − H(y)  
  
The intuition is that if you already know y then the additional information captured by the joint distribution of the two variables is simply the information in x conditional of knowing y.  
  
An example.  
Let x be the outcome of rolling a die and y a binary variable that indicates whether the outcome is odd or even.  

What is the conditional entropy of x (roll die) given y (odd or even)?  
H(x|y) = H(x,y) - H(y)  
H(x,y) = H(x) because in this case x (roll die) contains all the information in y (odd or even)   
so the conditional entropy  
H(x|y) = H(x) - H(y)

In [70]:
entropy(6) - entropy(2)

1.584962500721156

What is the conditional entropy of y given x (for the roll die)?  
The answer is zero.

## Mutual information
The mutual information I of
xy captures the amount of information that two random
variables have in common, meaning how much information we can learn
about one variable by observing another. It's a symmetric measure, so I of xy equals I of yx.  
If the two random
variables are independent, then, of course, the mutual
information is zero.  
If the two variables
are precisely the same, then the mutual information equals
the entropy of the variable itself, meaning that all the information
content is mutual.  
  
We can use the conditional entropy to
write the mutual information of x and y as the information content
of x minus the information content of x given that we
already know y:  
I(x,y) = H(x) - H(x|y)  
In other words, knowing y allows us to receive
I of xy bits in encoding x.  
Being a mutual and symmetric measure, the other way is also valid:  
I(x,y) = H(y) - H(y|x)  
  
  Let's continue our example of the die. X is the roll of the die and y is if odd or even.
We have just seen that conditional entropy H(x|y) = 1.58  
Then I(x,y) = H(x) - H(x|y) = H(x) - 1.58  

In [20]:
entropy(6) # this is H(x)

2.584962500721156

In [22]:
2.58 - 1.58

1.0

In this particular example, all of
the content of y is mutual to x and y because when we know x - 
what number has rolled - we would immediately know
y - whether it's even or odd.  
Therefore, the mutual information
content equals the entropy, or information content, of y - whether it's
even or odd - which is precisely one bit. 



In [23]:
entropy(2)

1.0