In this homework, we model the English words as a Markov chain of letters and try to generate random words and also try to estimate the missing letters of a given word. The model is somewhat cruel estimation however results do not seem to be reasonable considering the structure of English words.

The following code snippet loads the probability matrix.

In [2]:
import csv
from IPython.display import display, Latex

alphabet = [chr(i+ord('a')) for i in range(26)]
alphabet.append('.')
letter2idx = {c:i for i,c in enumerate(alphabet)}

T = []
with open('transitions.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        T.append(row)

print('Example')
## p(x_t = 'u' | x_{t-1} = 'q')
display(Latex(r"$p(x_t = \text{'u'} | x_{t-1} = \text{'q'})$"))
print(T[letter2idx['q']][letter2idx['u']])
display(Latex(r"$p(x_t | x_{t-1} = \text{'a'})$"))
for c,p in zip(alphabet,T[letter2idx['a']]):
    print(c,p)    

Example


<IPython.core.display.Latex object>

0.9949749


<IPython.core.display.Latex object>

a 0.0002835
b 0.0228302
c 0.0369041
d 0.0426290
e 0.0012216
f 0.0075739
g 0.0171385
h 0.0014659
i 0.0372661
j 0.0002353
k 0.0110124
l 0.0778259
m 0.0260757
n 0.2145354
o 0.0005459
p 0.0195213
q 0.0001749
r 0.1104770
s 0.0934290
t 0.1317960
u 0.0098029
v 0.0306574
w 0.0088799
x 0.0009562
y 0.0233701
z 0.0018701
. 0.0715219


## TASK 1

The following code generates random english words with a sepecified length using the probability matrix. To increase efficiency we utilize Markov property. Since the letters are only dependent on the previous letter, the letters can be generated iteratively starting from the beginning. Formally,
$$x_i\sim P(x_i|x_{i-1})$$ 
where $x_1\sim p(x_1|\text{'.'})$

In [105]:
import numpy as np
from random import randint
#Number of letters
def rand_word(N):
    letters=[]
    #Assume that every words starts with a '.'
    letters.append('.');
    #Iteratively generate letters.
    for i in range (0,N):
        prev=letters[i]
        prev_id=letter2idx[prev]
        column=T[prev_id]
        probs=np.array([float(j) for j in column])
        probs /= probs.sum()
        c=np.random.choice(alphabet,p=probs)
        letters.append(c)
    str=""
    for j in letters[1:]:
        str+=j
    return str
N=15
for i in range (5,N):
    print("Sample :",i-4,"   ",rand_word(i))

Sample : 1     ppr.o
Sample : 2     strl.t
Sample : 3     ro.fry.
Sample : 4     jucer.ta
Sample : 5     aayd.f.pp
Sample : 6     wine.bre.d
Sample : 7     t....be.fom
Sample : 8     te.to.an.o.w
Sample : 9     he.sed.intha.
Sample : 10     serne.ther.thi


## TASK2

The task is to fill the missing letters in given English words. Without implementing more general sum-product algorithm on the factor graph, we can efficiently estimate the words by using the Markov(1) property. This implementation is in fact a very special case of general sum-product algorithm. 

First thing to notice is that seperate chunk of missing letters are independent from each other. For instance for the word $\text{'th__br__n.f_x.'}$ we can divide this word as a set of words $\text{S={'h__b','r__n','f_x'}}$ and fill the missing letters of each word seperately. This comes from the fact that whenever a letter is given, the chain is broken from that point.

Estimation is performed recursively starting from the last element of the missing chunk. 

Let's say that we have a sequence of $x_2=a,x_3,x_4,x_5,x_6=t$,


To estimate  $x_5$ we should marginalize the probability distribution by summing over other missing letters. Formally,

$$P(x_5|x_2=a,x_6=t)\propto \sum_{x_3,x_4} P(x_3,x_4,x_5|x_2=a,x_6=t)$$
$$P(x_5|x_2=a,x_6=t)\propto \sum_{x_3,x_4} P(x_3|x_2=a)P(x_4|x3)P(x_5|x4)P(x_6=t|x_5)$$

Reorganizing terms,

$$P(x_5|x_2=a,x_6=t)\propto \sum_{x_3,x_4} P(x_6=t|x_5)P(x_5|x4)P(x_4|x3)P(x_3|x_2=a)$$

This is mathematically equivalent to,

$$P(x_5|x_2=a,x_6=t)\propto P(x_6=t|x_5)\sum_{x_4}P(x_5|x4)\sum_{x_3}P(x_4|x3)P(x_3|x_2=a)$$

Let's call $\vec{u}=p(x_6=t|x_5)$ and $\vec{v^T}=p(x_3|x_2=a)$

Notice that these sums correspond to matrix product. 

Mathematically,

$$P(x_5|x_2=a,x_6=t)\propto (\vec{u}T^2)\odot \vec{v^T} $$ 

where $\odot$ denotes element-wise product and T is probability matrix. (Due to Markov property, as we go forward the previous given terms becomes unnecessary )



In [89]:
from numpy import linalg as LA
test_strings = ['th__br__n.f_x.', '_u_st__n_.to_be._nsw_r__','i__at_._a_h_n_._e_r_i_g','q___t.___z._____t.__.___.__.']
#Convert T to float array.
T=np.array(T).astype(np.float)
#Estimate the letters in a given interval.
def estimate_interval(low,high,previ,nexti,charList):
        length=high-low+1
        if length==0:
            return charList
        
        row=T[:][previ]
        if(high==len(charList)-1):
            column=1
        else :
            column=[t[nexti] for t in T]
        
        v=row
        if length>1:
            v=np.dot(v,LA.matrix_power(T, length-1))
        
        w=list(v*column)
        final=np.random.choice(alphabet,p=w/np.sum(w))
        charList[high]=final
        if length==1:
            return charList
        else:
            return estimate_interval(low,high-1,previ,letter2idx[final],charList)
#Estimate the missing letters in a given string.        
def fill_missing_letters(string):    
    N=len(string)
    unknown=[]
    sum_terms=[]
    for ch in range (0,N):
        current=string[ch]
        if current=='?' or current=='_':
            unknown.append(ch)


    for i in range(0,len(unknown)):
        temp=[]
        if i==0:
            temp.append(unknown[i])
            sum_terms.append(temp)
        elif unknown[i]-unknown[i-1]==1:
            sum_terms[len(sum_terms)-1].append(unknown[i])
        else :
            temp.append(unknown[i])
            sum_terms.append(temp)


        
    result=list(list(string))
    for i in range(0,len(sum_terms)):
        low=sum_terms[i][0]
        up=sum_terms[i][len(sum_terms[i])-1]
        prevId=letter2idx[string[low-1]]
        if(up!=len(string)-1):
            nextId=letter2idx[string[up+1]]
        else :
            nextId=-1

        result=estimate_interval(low,up,prevId,nextId,result)

    return result

for i in test_strings:
    print("Word: ",i)
    for j in range (0,10):
        result='.'+i
        result=fill_missing_letters(result)
        print("Sample ",j+1,": ",''.join(result[1:len(result)]))




Word:  th__br__n.f_x.
Sample  1 :  the.bryon.fex.
Sample  2 :  thi.brirn.fix.
Sample  3 :  thanbrman.fex.
Sample  4 :  the.brean.fex.
Sample  5 :  theabrven.fex.
Sample  6 :  thoubry.n.fox.
Sample  7 :  the.broun.fox.
Sample  8 :  tho.br.an.fox.
Sample  9 :  thd.brean.fex.
Sample  10 :  the.brinn.f.x.
Word:  _u_st__n_.to_be._nsw_r__
Sample  1 :  qutsthono.to.be.answuren
Sample  2 :  puesthane.tombe.answarie
Sample  3 :  tuestren..toube.answorye
Sample  4 :  culste.nd.toube.answernl
Sample  5 :  ougst.ind.tombe.answarrr
Sample  6 :  cuosthang.to.be.onswar.p
Sample  7 :  mutst.ony.to.be.onsware.
Sample  8 :  jussthing.to.be.answor.d
Sample  9 :  qutstheng.tombe..nswirit
Sample  10 :  ourste.nd.tombe.answor.l
Word:  i__at_._a_h_n_._e_r_i_g
Sample  1 :  in.ate.pathind.peer.ing
Sample  2 :  ie.ato.lachint.wetr.ing
Sample  3 :  ingath.fatheng..e.rfing
Sample  4 :  ispate..athand.re.rming
Sample  5 :  it.ath.bathano.bepr.ing
Sample  6 :  ifeatw.fanhend.beerming
Sample  7 :  ithate.waghang.pea

## TASK3

We can use the same algorithm with little modification. For each missing element, we can choose the letter with highest probability. I started from the last missing element of a chunk which can affect the results.

In [138]:
from numpy import linalg as LA
test_strings = ['th__br__n.f_x.', '_u_st__n_.to_be._nsw_r__','i__at_._a_h_n_._e_r_i_g','q___t.___z._____t.__.___.__.']
#Convert T to float array.
T=np.array(T).astype(np.float)

def estimate_interval(low,high,previ,nexti,charList):
        length=high-low+1
        if length==0:
            return charList,1
        
        row=T[previ]
        if(high==len(charList)-1):
            column=1
        else :
            column=[t[nexti] for t in T]
        
        v=row
        if length>1:
            v=np.dot(v,LA.matrix_power(T, length-1))
        
        w=list(v*column)
        w=list(w/np.sum(w))
        final=w.index(np.max(w))
        charList[high]=alphabet[final]
        if length==1:
            return charList,np.max(w)
        else:
            tempList,prob=estimate_interval(low,high-1,previ,letter2idx[alphabet[final]],charList)
            return tempList,prob*np.max(w)
 

def fill_missing_letters(string):    
    N=len(string)
    unknown=[]
    sum_terms=[]
    for ch in range (0,N):
        current=string[ch]
        if current=='?' or current=='_':
            unknown.append(ch)


    for i in range(0,len(unknown)):
        temp=[]
        if i==0:
            temp.append(unknown[i])
            sum_terms.append(temp)
        elif unknown[i]-unknown[i-1]==1:
            sum_terms[len(sum_terms)-1].append(unknown[i])
        else :
            temp.append(unknown[i])
            sum_terms.append(temp)


        
    result=list(list(string))
    logprob=1
    for i in range(0,len(sum_terms)):
        low=sum_terms[i][0]
        up=sum_terms[i][len(sum_terms[i])-1]
        prevId=letter2idx[string[low-1]]
        if(up!=len(string)-1):
            nextId=letter2idx[string[up+1]]
        else :
            nextId=-1

        result,prob=estimate_interval(low,up,prevId,nextId,result)
        logprob*=prob

    return result,logprob

for i in test_strings:
    result='.'+i
    result,logprob=fill_missing_letters(result)
    print(''.join(result[1:len(result)]),np.log2(logprob))

the.br.an.fex. -4.43532766488
oursthand.to.be.answere. -15.7481002491
in.ath.wathend.he.r.ing -16.7873293326
qus.t.herz.athe.t.he.the.he. -31.1291155683


In reality letters are not dependent on only one previous letter, but two or more letters. To take into account this dependency, we may use Markov(n) model.