## Exercise 1: 
Evaluate the vector P for the text of the French declaration des droits de l'homme

In [107]:
import numpy as np
import pandas as pd

In [108]:
# Read txt file
file = open("Declaration1789.txt", 'r', encoding='latin-1')
data = file.read()
print(data)

[Préambule]

Les représentants du Peuple français, constitués en Assemblée nationale, considérant que l'ignorance, l'oubli ou le mépris des droits de l'homme sont les seules causes des malheurs publics et de la corruption des gouvernements, ont résolu d'exposer, dans une déclaration solennelle, les droits naturels, inaliénables et sacrés de l'homme, afin que cette déclaration, constamment présente à tous les membres du corps social, leur rappelle sans cesse leurs droits et leurs devoirs ; afin que les actes du pouvoir législatif et ceux du pouvoir exécutif, pouvant être à chaque instant comparés avec le but de toute institution politique, en soient plus respectés ; afin que les réclamations des citoyens, fondées désormais sur des principes simples et incontestables, tournent toujours au maintien de la constitution et au bonheur de tous.
En conséquence, l'Assemblée nationale reconnaît et déclare, en présence et sous les auspices de l'Etre suprême, les droits suivants de l'homme et du ci

In [109]:
# Loop through the file to count characters then store in a dictionary
character_dict = {}
idx = 0
for c in data:
    if c not in character_dict:
        character_dict[c] = {"count": 1, "prob": 0, "idx": idx}
        idx += 1
    else:
        character_dict[c]["count"] += 1

In [110]:
print(f"There {len(character_dict)} different characters in the file")


There 57 different characters in the file


In [111]:
# Calculate the probabilities of each characters
len_data = len(data)

chars = []
for key, value in character_dict.items():
    chars.append(key)
    value["prob"] = value["count"] / len_data

In [112]:
character_dict[" "]["prob"]

0.1672690763052209

## Exercise 2: Calculate the entropy of the file

In [113]:
H = 0
for key, value in character_dict.items():
    H -= value["prob"]*np.log2(value["prob"])

print(f"Entropy: {H}")

Entropy: 4.324709030776801


## Exercise 3: 
Assume that the declaration was generated by a first order Markov source. Evaluate the matrix P

We need to calculate the $P(X_n=a \mid X_{n-1}=b)$ for a and b are the character from the matrix.

Follow the Bayes rules, we have

$P(X_n=a\mid X_{n-1}=b) = \frac{P(X_n=a,X_{n-1}=b)}{P(X_{n-1}=b)}$

Or we can calculate the probability as

$P(X_n=a\mid X_{n-1}=b) = \frac{\text{Number of pair ba}}{\text{Number of b}}$

But for the last character in the document, it doesnt have a character follow so we need to take "Number of b" - 1 

In [114]:
# Count the number of pair in the document, for example "ab", "bc"
pair_dict = {}
for i in range(len(data)-1):
    tmp = data[i:i+2]
    if tmp not in pair_dict:
        pair_dict[tmp] = 1
    else:
        pair_dict[tmp] += 1

In [115]:
# for key, value in pair_dict.items():
#     pair_dict[key] = value/(len(data)-1)

In [116]:
print(f"There are {len(pair_dict)} pairs in the document")

There are 388 pairs in the document


In [117]:
print(pair_dict)

{'[P': 1, 'Pr': 1, 'ré': 23, 'éa': 2, 'am': 3, 'mb': 5, 'bu': 6, 'ul': 12, 'le': 91, 'e]': 1, ']\n': 1, '\n\n': 1, '\nL': 1, 'Le': 5, 'es': 101, 's ': 140, ' r': 19, 're': 61, 'ep': 3, 'pr': 27, 'és': 19, 'se': 24, 'en': 72, 'nt': 69, 'ta': 20, 'an': 30, 'ts': 18, ' d': 118, 'du': 9, 'u ': 21, ' P': 2, 'Pe': 1, 'eu': 33, 'up': 5, 'pl': 8, 'e ': 192, ' f': 12, 'fr': 1, 'ra': 17, 'nç': 1, 'ça': 1, 'ai': 17, 'is': 23, 's,': 18, ', ': 66, ' c': 56, 'co': 31, 'on': 69, 'ns': 37, 'st': 31, 'ti': 64, 'it': 59, 'tu': 11, 'ué': 4, ' e': 58, 'n ': 42, ' A': 18, 'As': 2, 'ss': 21, 'em': 26, 'bl': 26, 'lé': 6, 'ée': 15, ' n': 43, 'na': 12, 'at': 20, 'io': 29, 'al': 13, 'e,': 19, 'si': 17, 'id': 5, 'dé': 18, 'ér': 3, 't ': 111, ' q': 25, 'qu': 39, 'ue': 24, ' l': 111, "l'": 24, "'i": 4, 'ig': 5, 'gn': 2, 'no': 3, 'or': 17, 'nc': 14, 'ce': 33, "'o": 4, 'ou': 52, 'ub': 10, 'li': 29, 'i ': 19, ' o': 12, ' m': 10, 'mé': 3, 'ép': 6, 'ri': 18, 'de': 60, 'dr': 21, 'ro': 22, 'oi': 46, "'h": 6, 'ho': 10, 'o

There are 57 different characters in the document so the P matrix will have the shape 57x57

If the pair "xy" doesnt exist in the pair_dict then P[x][y] = 0

In [118]:
P = np.zeros((len(character_dict),len(character_dict)))
last_char = data[-1]
print(f"The last character is: {last_char}")

for key, value in pair_dict.items():
    a = key[1]
    b = key[0]
    if b == last_char:
        P[character_dict[a]["idx"]][character_dict[b]["idx"]] = value / (character_dict[b]["count"]-1)
    else:    
        P[character_dict[a]["idx"]][character_dict[b]["idx"]] = value / character_dict[b]["count"]


The last character is: .


In [119]:
print(P)

[[0.         0.         0.         ... 0.         0.         0.        ]
 [1.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.33333333 0.01171875 ... 0.         0.         0.        ]
 ...
 [0.         0.         0.         ... 0.         0.         0.        ]
 [0.         0.         0.         ... 0.         0.11111111 0.        ]
 [0.         0.         0.         ... 0.         0.11111111 0.        ]]


In [120]:
P_df = pd.DataFrame(P, index=chars, columns=chars)

In [121]:
P_df.shape

(57, 57)

In [122]:
print(P_df.iloc[0:5,0:5])

     [         P         r         é         a
[  0.0  0.000000  0.000000  0.000000  0.000000
P  1.0  0.000000  0.000000  0.000000  0.000000
r  0.0  0.333333  0.011719  0.022901  0.110599
é  0.0  0.000000  0.089844  0.000000  0.000000
a  0.0  0.000000  0.066406  0.015267  0.000000


In [123]:
print(character_dict["["])
print(pair_dict["[P"])

{'count': 1, 'prob': 0.00020080321285140563, 'idx': 0}
1


In [130]:
(P_df.loc[:,:]>1).sum().sum()

0