## Entropy

Chapter 3 of [Deep Learning](https://www.deeplearningbook.org/).

## Thermodynamic entropy

1st Law = conservation of mass & energy = you can only break even
- can never create or destroy energy

2nd Law = entropy always decreases =  you will always lose
- processes are irreversible
- puts limits on the efficiency of heat engines

## Claude Shannon

1916 - 2001.  American electrical engineer. [Wikipedia](https://de.wikipedia.org/wiki/Claude_Shannon).

![](assets/shannon.jpg)

Shannon founded **Infomation Theory** in 1948.  Shannon's paper [A Mathematical Theory of Communication](http://math.harvard.edu/~ctm/home/text/others/shannon/entropy/entropy.pdf) starts out with a discussion of the logarithm - I will as well.

## Use of the logarithm

The logarithm transforms exponential into linear relationships

Practical & intuitive

> Parameters of engineering importance such as time, bandwidth, number
of relays, etc., tend to vary linearly with the logarithm of the number of possibilities. For example,
adding one relay to a group doubles the number of possible states of the relays.

Mathematically suitable

> Many of the limiting operations are simple in terms of the logarithm but would require clumsy restatement in terms of the number of possibilities.

Choice of the logarithm base = determines the unit of infomation
- $\log_{e}$ = nats
- $\log_{2}$ = **bits**

## Bits of infomation

Bit = 0 or 1
- but not all bits are useful
- one bit reduces uncertantity by 2 (encoding independent)

Byte = eight bits
- this is the byte in megabytes
- can encode an integer from 0 to 255

## Infomation Theory

Originally developed in the context of sending communication via radio

Intuition = **learning an unlikely event has performed is more useful than learning a likely event has happened**

We want to quantify this intuition
- guranteed event = zero
- likely events = low
- less likely = high

## Shannon Entropy

Measurement of 
- randomness 
- unpredictability 
- uncertantity

$$H(x) = \mathbf{E}_{x \sim P}[\log P(x)] $$

Measurement of infomation
- how much infomation (on expectation) you get when sampling from a probability distribution

Low probability samples have more infomation
- biased coin is low entropy, unbiased coin is high entropy
- maximized for uniform distributions

In [None]:
import math

import matplotlib.pyplot as plt
import numpy as np
from scipy.stats import entropy as entropyr

from common import make_pmf

%matplotlib inline

def entropy(probs, base=2):
    assert sum(probs) == 1.0
    return sum([-prob * math.log(prob, base) for prob in probs])

np.testing.assert_allclose(entropyr([0.1, 0.9], base=2), entropy([0.1, 0.9]))

In [None]:
entropy([0.5, 0.5])

In [None]:
entropy([0.1, 0.9])

## Kullback-Leiber divergence (KLD)

Suppose we have two distributions $P(x)$ and $Q(x)$
- example = a parameterized neural net & the function we are trying to learn

We can measure the difference between the two using the **Kullback-Leiber divergence (KLD)** (it is not a true distance measure - not symetric)

$$D_{KL}(P||Q) = \mathbf{E}_{x \sim P}[\log P(x) - \log Q(x)] $$

The extra amount of infomation needed to send a message containing symbols from $P$ when using a code designed to minimize the length of messages for $Q$

## Cross entropy

$$ H(P,Q) = - \mathbf{E}_{x \sim P}\log Q(x)$$

Average number of bits needed to 
- identify a sample 
- with a coding scheme optimized for an estimated distribution $q$ 
- rather than the true distribution $p$

Minimizing the KLD is the same as minimizing the cross entropy

Cross_entropy = entropy + KLD

$$H(P,Q) = H(P) + D_{KL}(P||Q)$$

If true = predicted -> entropy = cross entropy

In [None]:
def cross_entropy(p, q):
    epsilon = 1e-16
    return sum([-tr * math.log(est + epsilon, 2) for tr, est in zip(p, q)])

In [None]:
cross_entropy([0.1, 0.9], [0.1, 0.9])

In [None]:
cross_entropy([0.5, 0.5], [0.1, 0.9])

## Minimizing cross entropy

A common operation in modern ML is minimizing cross entropy between a one hot encoded label & a softmax.

The **softmax** is a less aggressive form of one hot encoding

In [None]:
def softmax(X):
    exps = np.exp(X)
    return exps / np.sum(exps)

def normalize(x):
    return (x - np.min(x)) / (np.max(x) - np.min(x))

In [None]:
feature_map = np.random.normal(size=5)
feature_map

In [None]:
f, a = plt.subplots(nrows=2, sharey=True)
a[0].bar(np.arange(len(feature_map)), normalize(feature_map), label='normalized')
a[1].bar(np.arange(len(feature_map)), softmax(feature_map), label='softmax')
_ = f.legend()

In [None]:
label = np.zeros(feature_map.shape[0])
label[1] = 1

label

In [None]:
cross_entropy(softmax(feature_map), label)

We will now go through two examples to demonstrate the concept of entropy further.

## The weather example

[A Short Introduction to Entropy, Cross-Entropy and KL-Divergence - Aurélien Géron](https://www.youtube.com/watch?v=ErfnhcEV1O8)

In [None]:
#  weather in two places
weather = ['sunny', 'rainy']
_, probs = make_pmf(weather)

entropy(probs, 2)

In [None]:
message_size = len('rainy') * 8
message_size

In [None]:
#  uncertantity_reduction
entropy(make_pmf(['sunny', 'rainy'])[1]) - entropy(make_pmf(['rainy'])[1])

In [None]:
#  weather in eight states - equally likely
weather = np.arange(8)
weather

In [None]:
#  uncertantity_reduction
entropy(make_pmf(weather)[1]) - entropy(make_pmf(['rainy'])[1])

In [None]:
#  what about different probabilities?
weather = ['sunny', 'sunny', 'sunny', 'rainy']
entropy(make_pmf(weather)[1], 2)

In [None]:
#  amount of infomation you gain if you find out it is rainy
-math.log(0.25, 2)

In [None]:
#  amount of infomation you gain if you find out it is sunny
-math.log(0.75, 2)

In [None]:
#  average infomation when you learn the weather
0.25 * -math.log(0.25, 2) + 0.75 * -math.log(0.75, 2)

Lets encode our weather using three bits - we have an average message length of three bits:

In [None]:
weather = [
    '000', '001', '010', '011', '100', '101', '110', '111'
]
entropy(weather, 2)

What if our weather distribution changes:

In [None]:
probs = [
    0.35, 0.35, 0.1, 0.1, 0.04, 0.04, 0.01, 0.01
]
entropy = sum([-prob * math.log(prob, 2) for prob in probs])
entropy

So our weather station is sending an average of 3 bits per message, when the weather's entropy is only 2.23 bits (i.e. we only get 2.23 useful bits)

Lets encode our weather smarter:

In [None]:
weather = [
    '00', '01', '100', '101', '1100', '1101', '11100', '11101'
]

cross_entropy = np.sum([len(msg) * prob for msg, prob in zip(weather, probs)])
cross_entropy

Lets imagine the weather is reversed:

In [None]:
cross_entropy = np.sum([len(msg) * prob for msg, prob in zip(weather, reversed(probs))])
cross_entropy

## The bucket example

https://www.youtube.com/watch?v=9r7FIXEAGvs

Lets imagine we have three buckets:


In [None]:
b1 = ['red'] * 4
b2 = ['red'] * 3 + ['green']
b3 = ['red'] * 2 + ['green'] * 2

One way to look an entropy is to consider how many different ways we can rearrange this set:

In [None]:
from itertools import permutations 

set(permutations(b1))

In [None]:
set(permutations(b2))

In [None]:
set(permutations(b3))

We can be more precise of we think in terms of infomation.

How much do we know about what ball we will pick from each bucket?
- b1 = high knowledge
- b2 = medium knowledge
- b3 = low knowledge

Let's imagine we play a game where we win if we pick balls out in a given specific order (i.e. as the lists are defined above).  

What is the probability of winning for each of our buckets?

If we sample with replacement, we are sampling independently
- therefore the probability is a product of all the events

In [None]:
b1_odds = 1 * 1 * 1 * 1

b2_odds = 0.75 * 0.75 * 0.75 * 0.25

b3_odds = 0.5 ** 4

But we don't like the products
- if we have many probabilities, the product becomes very small
- changing one number can change the entire product by an unknown amount

Use a log to change the product into a sum

$$ \log(ab) = \log(a) + \log(b) $$

In [None]:
from math import log

In [None]:
-(log(1, 2) * 4) / 4

In [None]:
-(log(0.75, 2) * 3 + log(0.25, 2)) / 4

In [None]:
(-log(0.5, 2) * 4) / 4

Another case of 5 red balls, 3 green:

In [None]:
- (5/8 * log(5/8, 2) + 3/8 * log(3/8, 2))

What if there are more classes?
- here we can connect entropy with infomation gain

In [None]:
s1 = 'a' * 8
s2 = 'a' * 4 + 'b' * 2 + 'c' + 'd'
s3 = 'a' * 2 + 'b' * 2 + 'c' * 2 + 'd' * 2

How do we order these in terms of how easy it is to guess a random letter?

For the first sequence:

In [None]:
- (1 * log(1, 2))

For the second sequence:

In [None]:
-(1/2 * log(1/2, 2) + 1/4 * log(1/4, 2) + 1/8 * log(1/8, 2) + 1/8 * log(1/8, 2))

For the third sequence:

In [None]:
- log(1/4, 2)

## Quiz

For s1 - we only need a single question -> 0 entropy

For s3 - how can we ask (on average) 2 questions?

For s2 - how can we ask 1.75 questions?

The entropy is the average number of questions we need to ask
- if we use a smart series of questions
- height = num of questions we need to ask to figure out letter
- if height = k, we have 2^k letters on the bottom

![](assets/log.png)