## Hypothesis/Source of Confusion: 
We claim Mutual Information is symmetric. Which is evident from equations and venn diagram -- but not so evident intuitively. 


"Reduction in uncertainty about Y(a random variable) after knowing about X(a random variable) is same as reduction in uncertainty about Y after knowing about X."  But -- let's say XY is a 2 lettered english word. I tell you that the second letter is "O", would the amount of uncertainty reduction in the first letter be same as compared to when I tell that the first letter is "N". Intuitively the uncertainty reduction in the second case should be higher as compared the first case (as the number of 2 lettered words which start with "N" is very less as compared to the number of words that end with "O"). So how does symmetry of mutual information hold true here? 

## Problem statement : 
If we have a 2 lettered word. Then we need to show that "knowing" about one letter reduces the uncertainity in the second letter same as the uncertainity reduction in the first letter when we know the second letter.

## Method:
- Fetched all meaningful 2 lettered english words.
- Created a joint prob distribution assuming they have equal occurrence (first letter as rv X, second letter as rv Y)
- Calculated the marginal entropy, joint entropy, conditional entropy ,mutual information
- Proved our initial intuition that they aren't independent (as a side result!)
- Also realised that the difference in the --  "uncertainty assuming indpependent" - "actual uncertaintiy" = mutual information. In hindsight -- very obvious from venn diagram but still :P
- Observed that uncertainty reduces with evidence "on average" (as a side result!)
- By implementing this, it got clear that I should start thinking in terms of probabilities and expectation.

## Learning:
Learnt the hard way that when we say "knowing" we mean knowing a probability distribution and not the individual value sampled from the distribution. The conditional entropy always reduces, Mutual informatoon is symmetric -- these statements hold true "on average". Thus, its not about the uncertainty about the first letter after we know "O" -- but the uncertainity in the first letter after we are given the prob distribution of the second letter. This means we will average the "indidual conditional probabilitties for each value of the second letter" and then substract it from the original uncertainity on the first letter. Even though individually the case with "N" as first letter has the max mutual information H(Y|X="N") = 0 but this is just one term -- you need to calculate the average sum of such individual conditional entropies (weighed by th prob of occurrence of second letter) and then substract from the original uncertainty of the second letter. TLDR; I finally realised that I need to substract the average across the entire probability distribution weighted by the probability of their occurrence (so this accounts not just for "N" but for "A-Z" every element from the set of X). 

In [2]:
import numpy as np
import string
import json
import copy
from spellchecker import SpellChecker
    
    

In [11]:
#surprisingly there were a lot of weird words in the corpus.
def clean_word_list(wrd_lst):
    cp_lst = copy.deepcopy(wrd_lst)
    spell = SpellChecker()
    misspelled = spell.unknown(wrd_lst)
    for wrd in misspelled:
        cp_lst.remove(wrd)
    return cp_lst 


In [12]:
## Download from here: https://github.com/dwyl/english-words/blob/master/words_dictionary.json
filepath = '/home/ayhaos/Downloads/words_dictionary.json'
with open(filepath, 'r') as file:
    data = json.load(file)

word_list = set(data.keys())
wrd_lst = set(w.lower() for w in word_list if len(w) == 2)
final_list = clean_word_list(wrd_lst)
print(len(final_list))


217


In [13]:
freq_matrix = np.zeros((26,26), dtype=int) # the sample set  of X and Y both are entire english alphabet
legend = {alpha:i for i,alpha in enumerate(list(string.ascii_lowercase))}
for wrd in wrd_lst:
    freq_matrix[legend[wrd[0]]][legend[wrd[1]]] = 1
    
#normalize to get word prob (we assume equal word occurence)
prob_matrix = freq_matrix / freq_matrix.sum()

In [14]:
#entropy H(X|Y) 
P_X = np.sum(prob_matrix, axis=1) #marginal of X -- over all possible Y
P_Y = np.sum(prob_matrix, axis=0) #marginal of Y -- over all possible X


def P_Y_given_X(prob_matrix, idx):
    #Bayes Theorem
    #what if the prob of X=x_i is 0
    if np.sum(prob_matrix, axis=1)[idx] == 0:
        return 1 # to make log (1) = 0 so that sum doesn't get affected
    return prob_matrix[idx] / np.sum(prob_matrix, axis=1)[idx]

def P_X_given_Y(prob_matrix, idx):
    #Bayes Theorem
    #what if the prob of Y=y_i is 0
    if np.sum(prob_matrix, axis=0)[idx] == 0:
        return 1 # to make log (1) = 0 so that sum doesn't get affected
    return prob_matrix[:,idx] / np.sum(prob_matrix, axis=0)[idx]


H_X = -1 * np.sum(P_X*np.log2(P_X))# summation p(x).log(p(x))
H_Y = -1 * np.sum(P_Y*np.log2(P_Y))# summation p(y).log(p(y))

print(f'Marginal Entropy of X: {H_X:.4f}')
print(f'Marginal Entropy of Y: {H_Y:.4f}\n')

##summation over x,y of p(x,y)log(1/p(y|x)) 
H_Y_given_X =  0
for (x,y),val in np.ndenumerate(prob_matrix):
    if val != 0:
        H_Y_given_X += val*np.log2(P_Y_given_X(prob_matrix, x)[y])

##summation over x,y of p(x,y)log(1/p(x|y)) 
H_X_given_Y = 0
for (x,y),val in np.ndenumerate(prob_matrix):
    if val != 0:
        H_X_given_Y += val*np.log2(P_X_given_Y(prob_matrix, y)[x])

H_Y_given_X = -1* H_Y_given_X
H_X_given_Y = -1* H_X_given_Y
print(f'Expected Uncertainity in Y after observing X: {H_Y_given_X:.4f}') #the uncertainity always reduces with additional evidence (On Average!)
print(f'Expected Uncertainity in X after observing Y: {H_X_given_Y:.4f}\n') #the uncertainity always reduces with additional evidence (On Average!)

#Joint Entropy
H_XY = H_X + H_Y_given_X 

#Joint Entropy if we assume them independent 
H_XY_wrong = H_X + H_Y

#the difference proves that they aren't independent
print(f'Correct Joint Entropy: {H_XY:.4f}')
print(f'Joint Entropy if we assumed both independent: {H_XY_wrong:.4f}')
print(f'Mutual Information as a delta of joint entropy of independent - dependent: {H_XY_wrong-H_XY:.4f}\n')

#mutual information
I_XY = H_Y - H_Y_given_X
print(f'Mutual Information as expected reduction in uncertainity of Y given we observe X:{I_XY:.4f}')

#let's see if symmetric
I_XY2 = H_X - H_X_given_Y
print(f'Mutual Information as expected reduction in uncertainity of X given we observe Y: {I_XY2:.4f}')


Marginal Entropy of X: 4.6227
Marginal Entropy of Y: 4.5881

Expected Uncertainity in Y after observing X: 4.1154
Expected Uncertainity in X after observing Y: 4.1500

Correct Joint Entropy: 8.7381
Joint Entropy if we assumed both independent: 9.2108
Mutual Information as a delta of joint entropy of independent - dependent: 0.4727

Mutual Information as expected reduction in uncertainity of Y given we observe X:0.4727
Mutual Information as expected reduction in uncertainity of X given we observe Y: 0.4727
