Name: M. Kutay Yabas

I hereby declare that I observed the honour code of the university when preparing the homework.

## Programming Homework 3

In this exercise we model a string of text using a Markov(1) model. For simplicity we only consider letters 'a-z'. Capital letters 'A-Z' are mapped to the corresponding ones. All remaining letters, symbols, numbers, including spaces, are denoted by '.'.


We have a probability table $T$ where $T_{i,j} = p(x_t = j | x_{t-1} = i)$  transition model of letters in English text for $t=1,2 \dots N$. Assume that the initial letter in a string is always a space denoted as $x_0 = \text{'.'}$. Such a model where the probability table is always the same is sometimes called a stationary model.

1. For a given $N$, write a program to sample random strings with letters $x_1, x_2, \dots, x_N$ from $p(x_{1:N}|x_0)$
1. Now suppose you are given strings with missing letters, where each missing letter is denoted by a question mark (or underscore, as below). Implement a method, that samples missing letters conditioned on observed ones, i.e., samples from $p(x_{-\alpha}|x_{\alpha})$ where $\alpha$ denotes indices of observed letters. For example, if the input is 't??.', we have $N=4$ and
$x_1 = \text{'t'}$ and $x_4 = \text{'.'}$, $\alpha=\{1,4\}$ and $-\alpha=\{2,3\}$. Your program may possibly generate the strings 'the.', 'twi.', 'tee.', etc. Hint: make sure to make use all data given and sample from the correct distribution. Implement the method and print the results for the test strings below. 
1. Describe a method for filling in the gaps by estimating the most likely letter for each position. Hint: you need to compute
$$
x_{-\alpha}^* = \arg\max_{x_{-\alpha}} p(x_{-\alpha}|x_{\alpha})
$$
Implement the method and print the results for the following test strings along with the log-probability  $\log p(x_{-\alpha}^*,x_{\alpha})$.
1. Discuss how you can improve the model to get better estimations.

In [1]:
test_strings = ['th__br__n.f_x.', '_u_st__n_.to_be._nsw_r__','i__at_._a_h_n_._e_r_i_g','q___t.___z._____t.__.___.__.']

Hint: The code below loads a table of transition probabilities for English text.

In [2]:
import csv
import pandas as pd
import numpy as np
from IPython.display import display, Latex

alphabet = [chr(i+ord('a')) for i in range(26)]
alphabet.append('.')
letter2idx = {c:i for i,c in enumerate(alphabet)}

T = []
with open('transitions.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        T.append(row)

print('Example')
## p(x_t = 'u' | x_{t-1} = 'q')
display(Latex(r"$p(x_t = \text{'u'} | x_{t-1} = \text{'q'})$"))
print(T[letter2idx['q']][letter2idx['u']])
display(Latex(r"$p(x_t | x_{t-1} = \text{'a'})$"))
# for c,p in zip(alphabet,T[letter2idx['a']]):
#     print(c,p)

Example


<IPython.core.display.Latex object>

0.9949749


<IPython.core.display.Latex object>

In [3]:
# Some tools that are going to be usefull
M = pd.DataFrame(T).astype("float")
M[letter2idx['a']].max()
M[letter2idx['a']].argmax()
idx2letter = {i:c for i,c in enumerate(alphabet)}

### 1

Given that $p(x_{1:N}|x_0)$;

$p(x_{1:N}|x_0) = p(x_{1}|x_{0} = x_{0}')p(x_{2}|x_{1})...p(x_{N}|x_{N-1}) $

The only conditional is $p(x_{1}|x_{0} = x_{0}')$. In this case $x_{0}' = .$
Starting from "." we can make random choices according to the probability distribution of the Transition Matrix.

In [4]:
import random

def guess_xk(x0):
    x = random.uniform(0, 1)
    cumulative_probability = 0.0
    for item, item_probability in zip(alphabet,  T[letter2idx[x0]]):
        cumulative_probability += float(item_probability)
        if x < cumulative_probability: break
    return item
    # numpy.random.choice have a bug in weights. After trying for some time I have foudn this algorithm from
    # https://www.safaribooksonline.com/library/view/python-cookbook-2nd/0596007973/ch04s22.html
    # Now i realized that it may be related to Python2&3 Differences.

def generate(N=10,x0="."):
    string = x0 # x0
    for i in range(N):
        string = string + guess_xk(string[-1])
    return string

print( generate(1000) )

.amis.t.ayofooubusis.achby.tun.mpe.l.s.th.e.ghe.ndeched.o.f.be.ces.th.ee.d.ited.th.gspitsevitilersog.th.aig.w.apan.se.tis.blad.litantyos.wheshe.estheave.thedsthers.ing.i.weetansim.pe.o.hrereveisaming.tassancas.wherithitrontom.eng.ny.he.t.d..rgrdincal.g.bughier.th.oudsped.th.hinth.wee.whar.medle.acowarmeder.me.e.areat.han.ceas.f.oupong.dars..oet.an.foninse.t.tma.stry.tsatheded.thed.msindit.bomod.s..f.of.arend.theron.cithe.ous.d.internud.wa.asth.tatung.astomme.manerseractherwimond.atimannche.oous.fateatime.s.hen.ce.ldur.the.remorof.ct.coveyofoulls..ans.athaue.stortling.lde.iesou.be.es.julag.igllpot.outherendivo.tscch.thatsasdwar.mes.hegindes..tigucarinomppe.tharino.savoly.tito.ilenof.isousthe.wepomeas.is.n.pes.pendevneapex.f.bo.in..s.ant.hucrnchimeanseesos.we.w.mutol..ghth.by.cti.r.ethothembongendemeercedima.powasin.owoousthy.wale.hy.a.blde.clly.at.iay.ts.rdld.d.therre.bexcthoume..lyon.ce.os.tir.d.husut.o.thashe.toff.ergnd.oped.t.dangrmoostr.cul.hake.aby.d.th.s.h.was..dint.ndof.br.tceie.

### 2

In this part our model is in the form

For states from 0 to N

$p(x_{1:N-1}|x_{0}=x_{0}',x_{N}=x_{N}') \propto p(x_{N}=x_{N}'|x_{1:N-1},x_{0}=x_{0}')p(x_{N-1}|x_{N-2},x_{0}=x_{0}')...p(x_{1}|x_{0}=x_{0}') $


$p(x_{1}|x_{0}=x_{0}',x_{N}=x_{N}')=\sum_{x_{2}:x_{N-1}}
p(x_{N}=x_{N}'|x_{N-1})...p(x_{2}|x_{1})p(x_{1}|x_{0}=x_{0}')$

which is;

T is the transition matrix

$p(x_{N}=x_{N}'|x{N-1})T^{N-2}p(x_{1}|x_{0}=x_{0}') $







In [5]:
def guess(b,m,e): #beginnig char, missing length, end char
    if b is "":
        b = '.'
    mpow = np.linalg.matrix_power(M,m)
    prob = M[letter2idx[b]].dot(mpow)*(M[letter2idx[e]])        
    prob = prob / prob.sum(axis=0)

    x = random.uniform(0, 1)
    cumulative_probability = 0.0
    for item, item_probability in zip(alphabet,  prob.values):
        cumulative_probability += float(item_probability)
        if x < cumulative_probability: break
    return item

In [6]:
predict = [list(string) for string in test_strings]
for string in predict:
    #print("".join(string))
    i = 0
    m = 0 #missing count
    b = ""
    e = ""
    while i < len(string):
        if string[i] == "_":
            m+=1
        else:
            if m == 0:
                b = string[i] #beginnig char
            if m > 0:
                e = string[i] #ending char
                for j in range(m):
                    g = guess(b,m,e)
                    b = g
                    string[i-(j+1)] = g
                    #print("".join(string))
                m = 0
        i+=1
        if i == len(string) and m > 0:
            #print (e,m)
            for j in range(m):
                g = guess_xk(e)
                e = g
                string[i-(j+1)] = g
                #print("".join(string))
    print("".join(string))

th.abrain.fex.
ouestaeno.toabe.inswurle
i.tats.lathany.reardiig
qcnst.uizz..ei.nt.en.ees.an.


### 3

Same as part 2 but we will get the argmax in every iteration.

In [7]:
def guess_ml(b,m,e): #beginnig char, missing length, end char
    if b is "":
        b = '.'
    mpow = np.linalg.matrix_power(M,m)
    prob = M[letter2idx[b]].dot(mpow)*(M[letter2idx[e]])        
    prob = prob / prob.sum(axis=0)
    return idx2letter[prob.argmax()], np.log(prob.max())

def guess_xk_ml(b):
    return idx2letter[ M[letter2idx[b]].argmax()], np.log( M[letter2idx[b]].max())
    

predict = [list(string) for string in test_strings]
for string in predict:
    sum_log_prog = 0
    print("".join(string))
    i = 0
    m = 0 #missing count
    b = ""
    e = ""
    while i < len(string):
        if string[i] == "_":
            m+=1
        else:
            if m == 0:
                b = string[i] #beginnig char
            if m > 0:
                e = string[i] #ending char
                for j in range(m):
                    #print(b,m,e)
                    g, log_prob = guess_ml(b,m,e)
                    sum_log_prog += log_prob
                    e = g
                    string[i-(j+1)] = g
                    print("".join(string))
                m = 0
        i+=1
        if i == len(string) and m > 0:
            #print (e,m)
            for j in range(m):
                g, log_prob = guess_xk_ml(e) # last char we have
                sum_log_prog += log_prob
                e = g
                string[i-(m-j)] = g
                print("".join(string))
    print("logprob",sum_log_prog)

th__br__n.f_x.
th_.br__n.f_x.
the.br__n.f_x.
the.br_in.f_x.
the.br.in.f_x.
the.br.in.fex.
logprob -5.09978757921
_u_st__n_.to_be._nsw_r__
ou_st__n_.to_be._nsw_r__
ou.st__n_.to_be._nsw_r__
ou.st_in_.to_be._nsw_r__
ou.st.in_.to_be._nsw_r__
ou.st.int.to_be._nsw_r__
ou.st.int.to.be._nsw_r__
ou.st.int.to.be.insw_r__
ou.st.int.to.be.inswur__
ou.st.int.to.be.inswurq_
ou.st.int.to.be.inswurqx
logprob -15.7117569057
i__at_._a_h_n_._e_r_i_g
i_.at_._a_h_n_._e_r_i_g
ie.at_._a_h_n_._e_r_i_g
ie.att._a_h_n_._e_r_i_g
ie.att..a_h_n_._e_r_i_g
ie.att..ath_n_._e_r_i_g
ie.att..athin_._e_r_i_g
ie.att..athint._e_r_i_g
ie.att..athint.he_r_i_g
ie.att..athint.heer_i_g
ie.att..athint.heer.i_g
ie.att..athint.heer.ing
logprob -16.008142732
q___t.___z._____t.__.___.__.
q__.t.___z._____t.__.___.__.
q_e.t.___z._____t.__.___.__.
qhe.t.___z._____t.__.___.__.
qhe.t.__rz._____t.__.___.__.
qhe.t._erz._____t.__.___.__.
qhe.t.herz._____t.__.___.__.
qhe.t.herz.____.t.__.___.__.
qhe.t.herz.___e.t.__.___.__.
qhe.t.herz.__he.t.

### 4
Markov 1 model is not enough to catch phonetic sounds of the english language. A deeper model will work better. As the capital letters are mapped to lower letters, information about the transition at the beginning of sentences are lost. Distribution of word lenghts might be useful, for example a single letter word of arbitrary charachters is impossible other than 'a' and 'I'. Common doubles, suffix and prefixes might also be benefical.