Name: Kemal Berk Kocabagli

I hereby declare that I observed the honour code of the university when preparing the homework.

In [1]:
from IPython.display import Image
from IPython.display import display, Latex

## Programming Homework 3

In this exercise we model a string of text using a Markov(1) model. For simplicity we only consider letters 'a-z'. Capital letters 'A-Z' are mapped to the corresponding ones. All remaining letters, symbols, numbers, including spaces, are denoted by '.'.


We have a probability table $T$ where $T_{i,j} = p(x_t = j | x_{t-1} = i)$  transition model of letters in English text for $t=1,2 \dots N$. Assume that the initial letter in a string is always a space denoted as $x_0 = \text{'.'}$. Such a model where the probability table is always the same is sometimes called a stationary model.

1. For a given $N$, write a program to sample random strings with letters $x_1, x_2, \dots, x_N$ from $p(x_{1:N}|x_0)$
1. Now suppose you are given strings with missing letters, where each missing letter is denoted by a question mark (or underscore, as below). Implement a method, that samples missing letters conditioned on observed ones, i.e., samples from $p(x_{-\alpha}|x_{\alpha})$ where $\alpha$ denotes indices of observed letters. For example, if the input is 't??.', we have $N=4$ and
$x_1 = \text{'t'}$ and $x_4 = \text{'.'}$, $\alpha=\{1,4\}$ and $-\alpha=\{2,3\}$. Your program may possibly generate the strings 'the.', 'twi.', 'tee.', etc. Hint: make sure to make use all data given and sample from the correct distribution. Implement the method and print the results for the test strings below. 
1. Describe a method for filling in the gaps by estimating the most likely letter for each position. Hint: you need to compute
$$
x_{-\alpha}^* = \arg\max_{x_{-\alpha}} p(x_{-\alpha}|x_{\alpha})
$$
Implement the method and print the results for the following test strings along with the log-probability  $\log p(x_{-\alpha}^*,x_{\alpha})$.
1. Discuss how you can improve the model to get better estimations.

In [2]:
test_strings = ['th__br__n.f_x.', '_u_st__n_.to_be._nsw_r__','i__at_._a_h_n_._e_r_i_g','q___t.___z._____t.__.___.__.']

Hint: The code below loads a table of transition probabilities for English text.

In [3]:
import csv # 27x27 table, stationary since x0 is fixed to be '.'
import numpy as np
from itertools import product # for cartesian product
import operator # for max index


from IPython.display import display, Latex

alphabet = [chr(i+ord('a')) for i in range(26)]
alphabet.append('.')
letter2idx = {c:i for i,c in enumerate(alphabet)}

T = []
with open('transitions.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        T.append(row) # add probabilities to T row by row

print('Example')
## p(x_t = 'u' | x_{t-1} = 'q')
display(Latex(r"$p(x_t = \text{'u'} | x_{t-1} = \text{'q'})$"))
print(T[letter2idx['q']][letter2idx['u']])
display(Latex(r"$p(x_t | x_{t-1} = \text{'a'})$"))
for c,p in zip(alphabet,T[letter2idx['a']]):
    print(c,p)
    


Example


<IPython.core.display.Latex object>

0.9949749


<IPython.core.display.Latex object>

a 0.0002835
b 0.0228302
c 0.0369041
d 0.0426290
e 0.0012216
f 0.0075739
g 0.0171385
h 0.0014659
i 0.0372661
j 0.0002353
k 0.0110124
l 0.0778259
m 0.0260757
n 0.2145354
o 0.0005459
p 0.0195213
q 0.0001749
r 0.1104770
s 0.0934290
t 0.1317960
u 0.0098029
v 0.0306574
w 0.0088799
x 0.0009562
y 0.0233701
z 0.0018701
. 0.0715219


## Part 1 

We are asked to sample the letter from

\begin{equation}
P = p(x_1,x_2...x_n|x_0='.') = \dfrac{p(x_0='.',x_1,x_2\dots,x_n)}{p(x_0='.')} = p(x_1|x_0='.') \cdotp p(x_2|x_1) \dots \cdotp p(x_n|x_{n-1})
\end{equation}

Since our model is Markov(1),
only the immediately preceding letter affects the probability distribution of the current letter.

P for letter $x_i$ is
\begin{equation}
P_{x_i} = p(x_i|x_{i-1}=\hat{x_{i-1}}) 
\end{equation}

Our strategy will be to sample one by one and take the corresponding row of the transition matrix every time for the next letter. 


In [7]:
import csv # 27x27 table, stationary since x0 is fixed to be '.'
import numpy as np

from IPython.display import display, Latex

alphabet = [chr(i+ord('a')) for i in range(26)]
alphabet.append('.')
letter2idx = {c:i for i,c in enumerate(alphabet)}

T = [] # transition matrix
with open('transitions.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        row = [float(e) for e in row] # convert to float
        T.append(row) # add probabilities to T row by row


# algorithm for weighted choice
def weighted_choice(weights):
    totals = []
    running_total = 0

    for w in weights:
        running_total += w
        totals.append(running_total)

    rnd = np.random.random() * running_total
    for i, total in enumerate(totals):
        if rnd < total:
            return i
        
def sample(N):
    # N = word length
    x0 = '.' # x0 is space
    randomWord= x0 
    for i in range(N):
        #print("current letter: ", randomWord[i])
        p = T[letter2idx[randomWord[i]]] # get the probability of next letter given the current letter
        p /= np.sum(p) # normalize just in case
        randomWord+=(alphabet[weighted_choice(p)]) # get a letter with weighted probability and append it to the word 
        
    return(randomWord[1:]) # return the word omitting the initial space character
   
# main
print("Part 1")
for i in range(30):
    print("Sample of length ",i+1 ,": ",sample(i+1))

Part 1
Sample of length  1 :  s
Sample of length  2 :  co
Sample of length  3 :  ave
Sample of length  4 :  trds
Sample of length  5 :  d.wat
Sample of length  6 :  sid.sc
Sample of length  7 :  .aust.a
Sample of length  8 :  therrise
Sample of length  9 :  wlounvarr
Sample of length  10 :  toi.imbath
Sample of length  11 :  licowoach.h
Sample of length  12 :  any.ofronell
Sample of length  13 :  .f.nt.shtoto.
Sample of length  14 :  thesto.vararok
Sample of length  15 :  titondishethe.w
Sample of length  16 :  irs.hevithe.icej
Sample of length  17 :  pen.are.orethe.or
Sample of length  18 :  o.as.iathicren.the
Sample of length  19 :  .l.efaravere.lf.the
Sample of length  20 :  tlarareto.vounowouer
Sample of length  21 :  he.be.f.und.d.ben.cab
Sample of length  22 :  th.st.fuloch.d..fobeth
Sample of length  23 :  alo.e.heces.hious.t.iov
Sample of length  24 :  wid.d.incenor.s.ou.alath
Sample of length  25 :  weal.f.f.soundeanss.d.d.r
Sample of length  26 :  wo.cheref.d.foul.mmead.tym
S

## Part 2

For this part, we can consider 'chunks' of missing letters.
Define a chunk $C$ as $l_i x_1 x_2 x_3 \dots x_n l_f$ where $l_i$ is the initial letter and $l_f$ is the final letter of the chunk, which we know (they are given). Only these two letters will affect the missing letters in between because of the Markov(1) property of d-separation.

We will again go one by one in sampling the letters in a chunk. This time, however, once we sample $x_i$, we will have to recalculate taking $l_i =x_i$ and go until the end.

To begin with, we will sample $x_1$ from (the factorization comes from Markov(1) factor graph)

\begin{equation}
p(x_{1:n}|x_0=l_i,x_{n+1}=l_f) \propto \Big( p(x_1|x_0=l_i)* \sum\limits_{x_2:x_n}{p(x_2|x_1)* p(x_3|x_2) * \dots * p(x_{n+1}=l_f|x_n)} \Big) = p(x_1|x_0=l_i)* T^{n-1} \cdotp p(x_{n+1}=l_f|x_n)
\end{equation}
where $T$ is the transition matrix.

At this point, we have chosen $x_1$ to be $\hat{x_1}$.

Then $x_2$ from 
\begin{equation}
p(x_{2:n}|x_1=\hat{x_1},x_{n+1}=l_f) \propto \Big( p(x_2|x_1=\hat{x_1})* \sum\limits_{x_3:x_n} p(x_3|x_2)* p(x_4|x_3) * \dots * p(x_{n+1}=l_f|x_n) \Big) = p(x_2|x_1=\hat{x_1})* T^{n-2} \cdotp p(x_{n+1}=l_f|x_n)
\end{equation}

$\dots$

Finally, 
\begin{equation}
p(x_n|x_{n-1}=\hat{x_{n-1}},x_{n+1}=l_f) \propto p(x_n|x_{n-1}=\hat{x_{n-1}})* p(x_{n+1}=l_f|x_n)
\end{equation}

Once we are done with chunk 1, we will go to the next one until there is no chunk left to complete.

In [23]:
import csv # 27x27 table, stationary since x0 is fixed to be '.'
import numpy as np
import operator

from IPython.display import display, Latex

alphabet = [chr(i+ord('a')) for i in range(26)]
alphabet.append('.')
letter2idx = {c:i for i,c in enumerate(alphabet)}

T = [] # transition matrix
with open('transitions.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        row = [float(e) for e in row] # convert to float
        T.append(row) # add probabilities to T row by row

T = np.asarray(T) # convert T into array from list

test_strings = ['th__br__n.f_x.', '_u_st__n_.to_be._nsw_r__','i__at_._a_h_n_._e_r_i_g','q___t.___z._____t.__.___.__.']
        
def complete_word(incomplete_word):
    N = len(incomplete_word)
    a = []
    a_not = []
    for i in range(N):
        if incomplete_word[i]=='_' or incomplete_word[i]=='?':
            a_not.append(i)
        else: 
            a.append(i)    
    
    if len(a_not)==0:
        return incomplete_word
    
    missingchunks = [] 
    prev = a_not[0]-1
    missingchunk=[] # indicates the indices of a missing chunk; (!) missingchunk indices do not include l_i and l_f
    for i in a_not:
        if (i==prev+1):
            missingchunk.append(i)
        else: 
            missingchunks.append(missingchunk)
            missingchunk=[i]
        if(i==a_not[len(a_not)-1]):
                 missingchunks.append(missingchunk)
        prev=i
        
    #print(missingchunks)
    
    # Think of this as factorization. A chunk consists of li_______lf, 
    # where li is the given initial letter and lf is the given final letter of the chunk.
    # The letters in the next missing chunk do not depend on anything before its own li.
    
    # Markov(1) -> Given the present observation xi, the future is independent from the past(x{i-1},x{i-2}...)
    
    for mc in missingchunks: # for each missing chunk, fill in the blanks
        #print("In chunk: ", mc)
        li_index= mc[0]-1       
        lf_index= mc[len(mc)-1]+1
        
        # bounds check. If no li pr lf, take them as '.'
        if li_index<0:
            li='.'
        else:
            li=incomplete_word[li_index]
            
        if lf_index>len(incomplete_word)-1:
            lf='.'
        else:
            lf=incomplete_word[lf_index]
            
        #print(li,lf)
        for i in range(len(mc)): # fill xi            
            p = T[letter2idx[li]] # p(x1|li)
            tpower=1
            # Marginalization
            for j in range (i,len(mc)-1):
                tpower=np.dot(tpower,(T)) # p(x{i}|x{i-1});
               
            # at this point, p has info coming from li and the following dot product has info coming from lf.
            # we combine these to yield final prob dist
            p = p * np.dot(tpower,T[:,letter2idx[lf]]) # final prob dist = p(x1|li) * p(x1|lf)    
            p /= sum(p) # normalize
            
            xi = alphabet[weighted_choice(p)] # pick xi from final prob dist

            incomplete_word= incomplete_word[0:li_index+1+i]+xi+incomplete_word[li_index+i+2:]
            li=xi # xi is the new li
    
    return(incomplete_word)
        
        

# main
print("Part 2")
sample_size=10
for str in test_strings:
    print("Original string: ", str)
    for i in range(sample_size):
        print("Sample ", i+1, ":", complete_word(str))
    print("\n")

str = "Pr?gr?mm?ng?H?m?w?rk.3"
print("Here's something extra. This is what my algorithm thinks the homework title should be:")
for i in range(sample_size):
        print("Sample ", i+1, ":", complete_word(str.lower()))
print("\n") 


Part 2
Original string:  th__br__n.f_x.
Sample  1 : the.brern.fex.
Sample  2 : the.brzan.fix.
Sample  3 : thy.brsen.fex.
Sample  4 : the.br.on.fox.
Sample  5 : thoobrtin.fix.
Sample  6 : the.brorn.fex.
Sample  7 : the.br.on.fex.
Sample  8 : the.brt.n.fex.
Sample  9 : the.brain.fox.
Sample  10 : the.br.an.fex.


Original string:  _u_st__n_.to_be._nsw_r__
Sample  1 : rursthind.torbe.rnswor.o
Sample  2 : qussth.nd.tombe.wnswarle
Sample  3 : pu.sthant.to.be.answer.r
Sample  4 : fuist.ind.toube.answarof
Sample  5 : outstheno.torbe.answ.r.t
Sample  6 : bursthind.to.be.onswerte
Sample  7 : cutstuind.to.be.onswarse
Sample  8 : suastland.toube.answerat
Sample  9 : ougstwano.to.be.wnswerod
Sample  10 : bunsthen..toube.onswer.p


Original string:  i__at_._a_h_n_._e_r_i_g
Sample  1 : ineaty.waghind.beer.ing
Sample  2 : ifrate.sathand.reirling
Sample  3 : imaati.sathind.befrsing
Sample  4 : is.ath.tatheno.rearying
Sample  5 : il.ate.bachin..merr.ing
Sample  6 : id.atl.wa.hwnd.tefreing
Sample  7 : i

## Part 3

Everything is the same as in Part 2. The only difference is that after calculating the desired probability distribution for $x_i$, we will take the letter with the maximum probability. The cumulative sum of the log of the corresponding probabilities will be printed as well.

In [20]:
import csv # 27x27 table, stationary since x0 is fixed to be '.'
import numpy as np
import operator

from IPython.display import display, Latex

alphabet = [chr(i+ord('a')) for i in range(26)]
alphabet.append('.')
letter2idx = {c:i for i,c in enumerate(alphabet)}

T = [] # transition matrix
with open('transitions.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        row = [float(e) for e in row] # convert to float
        T.append(row) # add probabilities to T row by row

T = np.asarray(T)  # convert T into array from list

test_strings = ['th__br__n.f_x.', '_u_st__n_.to_be._nsw_r__','i__at_._a_h_n_._e_r_i_g','q___t.___z._____t.__.___.__.']
        
def complete_word(incomplete_word):
    logprob=0
    N = len(incomplete_word)
    a = []
    a_not = []
    for i in range(N):
        if incomplete_word[i]=='_' or incomplete_word[i]=='?':
            a_not.append(i)
        else: 
            a.append(i)    
    
    if len(a_not)==0:
        return (logprob,incomplete_word)
    
    missingchunks = [] 
    prev = a_not[0]-1
    missingchunk=[] # indicates the indices of a missing chunk; (!) missingchunk indices do not include l_i and l_f
    for i in a_not:
        if (i==prev+1):
            missingchunk.append(i)
        else: 
            missingchunks.append(missingchunk)
            missingchunk=[i]
        if(i==a_not[len(a_not)-1]):
                 missingchunks.append(missingchunk)
        prev=i
        
    #print(missingchunks)
    
    # Think of this as factorization. A chunk consists of li_______lf, 
    # where li is the given initial letter and lf is the given final letter of the chunk.
    # The letters in the next missing chunk do not depend on anything before its own li.
    
    # Markov(1) -> Given the present observation xi, the future is independent from the past(x{i-1},x{i-2}...)
    
    for mc in missingchunks: # for each missing chunk, fill in the blanks
        #print("In chunk: ", mc)
        li_index= mc[0]-1       
        lf_index= mc[len(mc)-1]+1
        
        # bounds check. If no li pr lf, take them as '.'
        if li_index<0:
            li='.'
        else:
            li=incomplete_word[li_index]
            
        if lf_index>len(incomplete_word)-1:
            lf='.'
        else:
            lf=incomplete_word[lf_index]
            
        #print(li,lf)
        for i in range(len(mc)): # fill xi            
            p = T[letter2idx[li]] # p(x1|li)
            tpower=1
            # Marginalization
            for j in range (i,len(mc)-1):
                tpower=np.dot(tpower,(T)) # p(x{i}|x{i-1});
               
            # at this point, p has info coming from li and the following dot product has info coming from lf.
            # we combine these to yield final prob dist
            p = p * np.dot(tpower,T[:,letter2idx[lf]]) # final prob dist = p(x1|li) * p(x1|lf)    
            p /= sum(p) # normalize
            
            xi = alphabet[np.argmax(p)] # pick xi to be the letter with maximum probability
            logprob+=np.log(p[np.argmax(p)]) # add log of this prob to the log cumulative sum

            incomplete_word= incomplete_word[0:li_index+1+i]+xi+incomplete_word[li_index+i+2:]
            li=xi # xi is the new li
    
    return(logprob,incomplete_word)
        
# main
print("Part 3")
sample_size=10
for str in test_strings:
    print("Original string: ", str)
    print("(log-probability, guess)=",complete_word(str),"\n")

str = "Pr?gr?mm?ng?H?m?w?rk.3"
print("Here's something extra. This is what my algorithm thinks the homework title should be:")
print("(log-probability, guess)=",complete_word(str.lower()), "\n" )

Part 3
Original string:  th__br__n.f_x.
(log-probability, guess)= (-3.0743348657731824, 'the.br.an.fex.') 

Original string:  _u_st__n_.to_be._nsw_r__
(log-probability, guess)= (-11.069327972319437, 'oursthend.to.be.answered') 

Original string:  i__at_._a_h_n_._e_r_i_g
(log-probability, guess)= (-11.636089996019512, 'in.ath.wathend.he.r.ing') 

Original string:  q___t.___z._____t.__.___.__.
(log-probability, guess)= (-22.923642422825658, 'qur.t.thiz.the.at.an.the.an.') 

Here's something extra. This is what my algorithm thinks the homework title should be:
(log-probability, guess)= (-6.626367363404877, 'pr.gr.mmang.hem.werk.3') 



## Part 4
The question could be solved using the sum-product algorithm, which works for much more general cases than Markov(1) type graphs. In fact, using a deeper level Markov model would give better results in predicting the missing letters. A given letter might influence two, three, or even all of the future letters rather than just one.