# Bayesian Machine Learning for Health Data


### Task 2 : -  Briefly explain and implement from scratch the following functions: i) cross-entropy; ii) entropy; iii) mutual information; iv) conditional entropy;   v) KL divergence. Take appropriate example toy data/distributions and explain the insights from calculating these quantities.


###### reference: https://tungmphung.com/information-theory-concepts-entropy-mutual-information-kl-divergence-and-more/


#### a) Entropy
This measures the degree of uncertainty in the events.The fundamental tenet of this expression is that we cannot determine which event is more likely to occur in reality if both events have comparable chances of occurring (entropy is high).
But let's imagine that the likelihood of some events occurring is significantly larger than that of the others. In such case, we may argue that, in reality, the event that is most likely to occur will be one of the events from the set with the higher probability (entropy is less). 

![alternatvie text](https://tungmphung.com/wp-content/ql-cache/quicklatex.com-6e5fe46a0e1159891d09684d0fe9d8b7_l3.svg)

In [None]:
def entropy(p):
    e = 0
    for i in range(len(p)):
        if p[i] > 0:
            e -= p[i] * math.log(p[i])
    return e

Example:
Suppose we have a biased coin that has a probability of 0.8 of landing heads and 0.2 of landing tails. The entropy of the distribution is:

H(p) = - (0.8 * log(0.8) + 0.2 * log(0.2)) = 0.502

Now take it to be same (0.5) for both of them. Now H(p) = - (0.5 * log(0.5) + 0.5 * log(0.5)) = 0.3010

Insights: A higher entropy value indicates more randomness or uncertainty in the distribution.

#### b) Cross-entropy:
Cross-entropy is a measure of the difference between two probability distributions, typically the predicted probability distribution and the true probability distribution. It is commonly used in machine learning for evaluating classification models. The formula for cross-entropy between two probability distributions p and q is:

![cross-entropy](https://tungmphung.com/wp-content/ql-cache/quicklatex.com-e5165ec6015917d7491e82de9b96959f_l3.svg)

In [None]:
def crossEntropy(a,b):
    hE=0
    assert len(a) == len(b)
    for i in range(0,len(a)):
        if p[i] > 0:
            h-=a[i]*math.log(b[i])
    return hE

Example:
Suppose we have a binary classification problem with two classes A and B. We have a model that predicts the probability of each class given an input. We also have a true distribution over the classes. Let's say the true distribution is p = [0.6, 0.4] and the predicted distribution by the model is q = [0.8, 0.2]. Then the cross-entropy between p and q can be calculated as follows:

H(p,q) = - (0.6 * log(0.8) + 0.4 * log(0.2)) = 0.289

Insights: A lower value of cross-entropy indicates a better match between the true distribution and the predicted distribution.

#### c) Mutual Information:

Mutual information is a measure of the amount of information that one random variable contains about another random variable.This is related to the Conditional Entropy, as Mutual Information is the amount of reduced uncertainty about A if we already know B. Given two random variables X and Y, the mutual information between X and Y is defined as:

![mutual-information](https://tungmphung.com/wp-content/ql-cache/quicklatex.com-fd3b6af9f9c9822135108e3906181eaf_l3.svg)

In [None]:
def mutual_information(p_joint, p_X, p_Y):
    p_X_Y = p_joint / p_Y[:, np.newaxis]
    return entropy(p_X) - entropy(p_X_Y)


Example:

Suppose we have two random variables X and Y, and their joint probability distribution is:

p_joint = [[0.2, 0.3],
           [0.1, 0.4]]

The marginal probability distributions of X and Y are:

p_X = [0.3, 0.6]
p_Y = [0.3, 0.7]

Then, the mutual information between X and Y is:

I(p_joint, p_x, p_y) = H(X) - H(X|Y) = (-[(0.3 * log(0.3)) + (0.6 * log(0.6))]) - (-[(0.2log(0.2/0.3)) + (0.3log(0.3/0.7)) + (0.1log(0.1/0.3)) + (0.4log(0.4/0.7))]) = 0.076

Insight:

The mutual information measures the amount of information that one random variable contains about another random variable. In this example, we can see that X and Y are not independent, as their mutual information is greater than zero. We can also see that X and Y are not perfectly correlated, as their mutual information is not equal to the entropy of either X or Y.

#### d) Conditional Entropy:

Conditional entropy is a measure of the amount of uncertainty or randomness in a random variable X given another random variable Y. Given two random variables X and Y, the conditional entropy of X given Y is defined as:


![cond-entropy](https://tungmphung.com/wp-content/ql-cache/quicklatex.com-49c58e49cea026276221bb464762ca43_l3.svg)

In [6]:
def conditional_Entropy(p_joint, p_y):
    p_x_given_y = p_joint / p_y[:, np.newaxis]
    return np.sum(-p_y * np.sum(p_x_given_y * np.log(p_x_given_y), axis=1))


Example:

Suppose we have two random variables X and Y, and their joint probability distribution is:

p_joint = [[0.2, 0.3],
[0.1, 0.4]]

The marginal probability distribution of Y is:

p_y = [0.3, 0.7]

Then, the conditional entropy of X given Y is:

H(X|Y) = -[(0.3 * [(0.2log(0.2/0.3)) + (0.3log(0.3/0.7))]) + (0.7 * [(0.1log(0.1/0.3)) + (0.4log(0.4/0.7))])] = 0.549

Insight:
The conditional entropy measures the amount of uncertainty or randomness in a random variable X given another random variable Y. In this example, we can see that the conditional entropy of X given Y is smaller than the entropy of X, which indicates that knowing Y reduces the uncertainty or randomness in X.

#### e) KL Divergence:

KL divergence is a measure of the difference between two probability distributions.The amount of redundant information we would typically have if we encoded A using the optimal encoding method of B is measured by the Kullback-Leibler Divergence of A from B, often known as the KL Divergence. 




![kl-divergence](https://tungmphung.com/wp-content/ql-cache/quicklatex.com-1b9ee98ad72b18fc48359a05edec4e0e_l3.svg)

In [7]:
def kl_divergence(p, q):
      return crossEntropy(p,q) - entropy(p) #already created the functions above in a) and b) parts of task 2

Example:

Suppose we have two probability distributions, p and q:

p = [0.3, 0.7]
q = [0.4, 0.6]

Then, the KL divergence of p and q is:
    Dkl(p || q) = (0.3log(0.3/0.4)) + (0.7log(0.7/0.6)) = 0.046

Insight:

The KL divergence measures the difference between two probability distributions. In this example, we can see that p and q are different, as their KL divergence is greater than zero. We can also see that the KL divergence is not symmetric, as Dkl(p || q) is different from Dkl(q || p).

###### I think that the KL divergence is not a true distance metric, as it does not satisfy the triangle inequality.
