<a href="https://colab.research.google.com/github/AoShuang92/PhD_tutorial/blob/main/entropy_uncertainty.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Entropy in Machine Learning
Entropy in physics is a measurement of randomness in an isolated system. It’s quite similar when it comes to machine learning! Here, entropy is also a measure of randomness. However, here, you measure the disorder of the information processed in your ML project.


Let’s use a simple example–flipping a coin. There can be two outcomes. However, they are difficult to predict because there is no direct relation between the flipping itself and the outcome. Whatever you do, it’s 50-50. In such a situation, entropy is high–getting conclusions from the information is difficult. 


https://stats.stackexchange.com/questions/25827/how-does-one-measure-the-non-uniformity-of-a-distribution

# Entropy with log base 2:

In [1]:
import numpy as np
from scipy.stats import entropy
from scipy.special import kl_div
from scipy.special import rel_entr

def get_shannon_entropy(A):
    pA = A / A.sum()
    se = -np.sum(pA*np.log2(pA))
    return se

def softmax(x):
    f_x = np.exp(x) / np.sum(np.exp(x))
    return f_x

rng = np.random.default_rng(12345)
dist = []
std = []
se = []
kl = []
for _ in range(10):
    rints = softmax(rng.random((10, 1)))
    dist.append(rints)
    std.append(np.std(rints))
    se.append(entropy(rints, base=2))
    kl.append(np.sum(kl_div(np.squeeze(rints), np.ones(10)/10)))
std = np.array(std)
se = 1 - np.array(se).squeeze()
kl = np.array(kl)
print(std.shape, se.shape, kl.shape)

print('std            :',np.argsort(std))
print('shannon entropy:',np.argsort(se))
print('kl_div         :',np.argsort(kl))

(10,) (10,) (10,)
std            : [5 3 6 0 4 2 7 8 1 9]
shannon entropy: [5 3 6 2 0 4 8 7 1 9]
kl_div         : [5 3 6 2 0 4 8 7 1 9]


# Entropy with log base 10 (numpy vs scipy vs tensor)

In [2]:
#from numpy
from scipy.stats import entropy
import numpy as np
def get_se_v1(p):
    logp = np.log(p)
    entropy1 = np.sum(-p*logp)
    return entropy1

p = np.array([0.1, 0.2, 0.4, 0.3])
print('numpy v1:', get_se_v1(p))

#scipy
print('scipy v2:',entropy(p))

#from tensor
import torch
from torch.distributions import Categorical
p_tensor = torch.Tensor(p)
entropy2 = Categorical(p_tensor).entropy()
print('tensor v3:', entropy2.item())

numpy v1: 1.2798542258336676
scipy v2: 1.2798542258336676
tensor v3: 1.2798542976379395


# Entropy range

In [3]:
p0 = torch.tensor([0.2, 0.2, 0.2, 0.2, 0.2])
p1 = torch.tensor([0.0, 0.0, 1, 0.0, 0.0])
p2 = torch.tensor([0.1, 0.0, 0.9, 0.0, 0.0])
p3 = torch.tensor([0.1, 0.2, 0.4, 0.2, 0.1])

se0 = Categorical(p0).entropy()
se1 = Categorical(p1).entropy()
se2 = Categorical(p2).entropy()
se3 = Categorical(p3).entropy()
print('se0:%0.4f'%se0.item(), '\nse1:%0.4f'%se1.item(), '\nse2:%0.4f'%se2.item(), '\nse3:%0.4f'% se3.item())

se0:1.6094 
se1:0.0000 
se2:0.3251 
se3:1.4708


# Why am I getting information entropy greater than 1?

Entropy can be greater than 1 depends on class category (binary or multiclass) and log base (2 or 10).

Entropy ranges from 0-1 for binary classification problems and 0 to log base 2 k, where k is the number of classes you have. Entropy measures the "information" or "uncertainty" of a random variable. When you are using base 2, it is measured in bits; and there can be more than one bit of information in a variable.

Binary Category: Log base 2 Vs. Log base 10:

In [4]:
def get_se_log10(p):
    logp = np.log(p)
    entropy1 = np.sum(-p*logp)
    return entropy1

def get_se_log2(p):
    logp = np.log2(p)
    entropy1 = np.sum(-p*logp)
    return entropy1

p = np.array([0.5, 0.5])
get_se_log10(p), get_se_log2(p)

(0.6931471805599453, 1.0)

Multiclass Category: Log base 2 Vs. Log base 10:

In [5]:
p = np.array([0.2, 0.2, 0.2, 0.2, 0.2])
get_se_log10(p), get_se_log2(p)

(1.6094379124341005, 2.321928094887362)

# Therefore, log base 2 is only ideal for binary clssification only.

# Entropy on multiple tensors at a time (log base 10)

In [5]:
p = torch.stack([p0, p1, p2, p3])
se = Categorical(p).entropy()
print('se0:%0.4f'%se[0].item(), '\nse1:%0.4f'%se[1].item(), '\nse2:%0.4f'%se[2].item(), '\nse3:%0.4f'% se[3].item())

se0:1.6094 
se1:0.0000 
se2:0.3251 
se3:1.4708


In [6]:
p0 = torch.tensor([0.2, 0.8])
p1 = torch.tensor([0.8, 0.2])
p2 = torch.tensor([0.7, 0.3])
p3 = torch.tensor([0.5, 0.5])
p = torch.stack([p0, p1, p2, p3])
e = Categorical(p).entropy()

c1 = e[1] * (1-0.8) ** (-0.8) # [0.8, 0.2]
c2 = e[2] * (1-0.7) ** (-0.7) # [0.7, 0.3]
c3 = e[3] * (1-0.5) ** (-0.5) # [0.5, 0.5]
c1, c2, c3

(tensor(1.8134), tensor(1.4189), tensor(0.9803))

In [8]:
c0 = e[1] * (1-0.8) ** (-(1-0.8)) # [0.2, 0.8] gt, entropy small
c1 = e[1] * (1-0.8) ** (-(1-0.2)) # [0.8, 0.2] wrong pred, entropy big
c0, c1, e[1]

(tensor(0.6904), tensor(1.8134), tensor(0.5004))

In [22]:
p0 = torch.tensor([0.2, 0.2, 0.1, 0.1, 0.4]) #s0. e0
p1 = torch.tensor([0.2, 0.2, 0.1, 0.4, 0.1]) #s3. e1
p2 = torch.tensor([0.2, 0.2, 0.1, 0.3, 0.2]) #s2. e2
p3 = torch.tensor([0.2, 0.2, 0.2, 0.2, 0.2]) #s1  e3
p = torch.stack([p0, p1, p2, p3])
e = Categorical(p).entropy()

c0 = e[0] * (1-0.4) ** (-(1-0.4)) 
c1 = e[1] * (1-0.4) ** (-(1-0.1)) 
c2 = e[2] * (1-0.3) ** (-(1-0.2)) 
c3 = e[3] * (1-0.2) ** (-(1-0.2)) 
e[0], c0, c1, c2, c3

(tensor(1.4708),
 tensor(1.9983),
 tensor(2.3293),
 tensor(2.0713),
 tensor(1.9240))

In [87]:
p0 = torch.tensor([0.2, 0.2, 0.1, 0.1, 0.4]) #s0. e0
p1 = torch.tensor([0.2, 0.2, 0.1, 0.4, 0.1]) #s3. e1
p2 = torch.tensor([0.2, 0.2, 0.1, 0.3, 0.2]) #s2. e2
p3 = torch.tensor([0.2, 0.2, 0.2, 0.2, 0.2]) #s1  e3
p = torch.stack([p0, p1, p2, p3])
e = Categorical(p).entropy()

c0 = e[0] * (1-0.4) ** (-0.4) # [0.8, 0.2]
c1 = e[1] * (1-0.4) ** (-0.4) # [0.8, 0.2]
c2 = e[2] * (1-0.3) ** (-0.3) # [0.7, 0.3]
c3 = e[3] * (1-0.2) ** (-0.2) # [0.5, 0.5]
e[0], c0, c1, c2, c3

(tensor(1.4708),
 tensor(1.8042),
 tensor(1.8042),
 tensor(1.7330),
 tensor(1.6829))

In [104]:
p0 = torch.tensor([0.2, 0.2, .19, 0.01, 0.4]) #s0. e0
p1 = torch.tensor([0.025, 0.025, 0.3, 0.3, 0.35]) #s3. e1

p = torch.stack([p0, p1])
e = Categorical(p).entropy()
c0 = e[0] * (1-0.4) ** (-0.4) # [0.8, 0.2]
c1 = e[1] * (1-0.35) ** (-0.35) # [0.8, 0.2]
e, p1.sum(), c0, c1

(tensor([1.3719, 1.2743]), tensor(1.), tensor(1.6829), tensor(1.4816))

In [9]:
p0 = torch.tensor([0.2, 0.2, .19, 0.01, 0.4]) #s0. e0
p1 = torch.tensor([0.025, 0.025, 0.3, 0.3, 0.35]) #s3. e1

p = torch.stack([p0, p1])
e = Categorical(p).entropy()
c0 = e[0] * (1-0.4) ** (-(1-0.4)) # [0.8, 0.2]
c1 = e[1] * (1-0.35) ** (-(1-0.35)) # [0.8, 0.2]
e, p1.sum(), c0, c1

(tensor([1.3719, 1.2743]), tensor(1.), tensor(1.8639), tensor(1.6860))

In [None]:
|p0 = torch.tensor([0.2, 0.8])
p1 = torch.tensor([0.8, 0.2])
p2 = torch.tensor([0.7, 0.3])
p3 = torch.tensor([0.5, 0.5])
p = torch.stack([p0, p1, p2, p3])
e = Categorical(p).entropy()

c1 = e[1] * (1-0.8) ** (-0.8) # [0.8, 0.2]
c2 = e[2] * (1-0.7) ** (-0.7) # [0.7, 0.3]
c3 = e[3] * (1-0.5) ** (-0.5) # [0.5, 0.5]
c1, c2, c3

In [73]:
c1 = 0.6109 * (1-0.7) ** (-0.7) # [0.7, 0.3]
c2 = 0.5004 * (1-0.8) ** (-0.8) # [0.8, 0.2]
c3 = 0.6931 * (1-0.5) ** (-0.5) # [0.5, 0.5]
c1, c2, c3

(1.4190093165089037, 1.8133987185215945, 0.9801914200807923)

In [70]:
0.5004 * (1-0.8) ** (-0.8) # [0.8, 0.2]

1.8133987185215945

In [71]:
0.6931 * (1-0.5) ** (-0.5) # [0.5, 0.5]

0.9801914200807923

In [26]:
np.log(0.7)* (-0.7) + np.log(0.3)*(-0.3)

0.6108643020548935

In [49]:
np.log(0.7)* (-0.7),  -np.log(0.7)* (1-0.7)**(0.5)

(0.2496724607571127, 0.1953589124921343)

In [50]:
np.log(0.3)*(-0.3), -np.log(0.3)*(1-0.3)**(0.5)

(0.3611918412977808, 1.007315918413643)

In [51]:
np.log(0.7)* (-0.7) + np.log(0.3)*(-0.3)

0.6108643020548935

In [52]:
-np.log(0.7)* (1-0.7)**(0.5) + (-np.log(0.3)*(1-0.3)**(0.5))

1.2026748309057773