## Pr?gr?mm?ng?H?m?w?rk 3

In this exercise we model a string of text using a Markov(1) model. For simplicity we only consider letters 'a-z'. Capital letters 'A-Z' are mapped to the corresponding ones. All remaining letters, symbols, numbers, including spaces, are denoted by '.'.


We have a probability table $T$ where $T_{i,j} = p(x_t = j | x_{t-1} = i)$  transition model of letters in English text for $t=1,2 \dots N$. Assume that the initial letter in a string is always a space denoted as $x_0 = \text{'.'}$. Such a model where the probability table is always the same is sometimes called a stationary model.

1. For a given $N$, write a program to sample random strings with letters $x_1, x_2, \dots, x_N$ from $p(x_{1:N}|x_0)$
1. Now suppose you are given strings with missing letters, where each missing letter is denoted by a question mark (or underscore, as below). Implement a method, that samples missing letters conditioned on observed ones, i.e., samples from $p(x_{-\alpha}|x_{\alpha})$ where $\alpha$ denotes indices of observed letters. For example, if the input is 't??.', we have $N=4$ and
$x_1 = \text{'t'}$ and $x_4 = \text{'.'}$, $\alpha=\{1,4\}$ and $-\alpha=\{2,3\}$. Your program may possibly generate the strings 'the.', 'twi.', 'tee.', etc. Hint: make sure to make use all data given and sample from the correct distribution. Implement the method and print the results for the test strings below. 
1. Describe a method for filling in the gaps by estimating the most likely letter for each position. Hint: you need to compute
$$
x_{-\alpha}^* = \arg\max_{x_{-\alpha}} p(x_{-\alpha}|x_{\alpha})
$$
Implement the method and print the results for the following test strings along with the log-probability  $\log p(x_{-\alpha}^*,x_{\alpha})$.
1. Discuss how you can improve the model to get better estimations.

In [1]:
test_strings = ['th__br__n.f_x.', '_u_st__n_.to_be._nsw_r__','i__at_._a_h_n_._e_r_i_g','q___t.___z._____t.__.___.__.']

Hint: The code below loads a table of transition probabilities for English text.

In [2]:
import csv
from IPython.display import display, Latex

alphabet = [chr(i+ord('a')) for i in range(26)]
alphabet.append('.')
letter2idx = {c:i for i,c in enumerate(alphabet)}

T = []
with open('transitions.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        T.append(row)

print('Example')
## p(x_t = 'u' | x_{t-1} = 'q')
display(Latex(r"$p(x_t = \text{'u'} | x_{t-1} = \text{'q'})$"))
print(T[letter2idx['q']][letter2idx['u']])
display(Latex(r"$p(x_t | x_{t-1} = \text{'a'})$"))
for c,p in zip(alphabet,T[letter2idx['a']]):
    print(c,p)

Example


<IPython.core.display.Latex object>

0.9949749


<IPython.core.display.Latex object>

a 0.0002835
b 0.0228302
c 0.0369041
d 0.0426290
e 0.0012216
f 0.0075739
g 0.0171385
h 0.0014659
i 0.0372661
j 0.0002353
k 0.0110124
l 0.0778259
m 0.0260757
n 0.2145354
o 0.0005459
p 0.0195213
q 0.0001749
r 0.1104770
s 0.0934290
t 0.1317960
u 0.0098029
v 0.0306574
w 0.0088799
x 0.0009562
y 0.0233701
z 0.0018701
. 0.0715219


## Preprocessing

Variables created here is used later on in the program.

In [3]:
import numpy as np
import string

alphabet = list(string.ascii_lowercase)
alphabet.append('.')
transition = {}
for indx1, prob in enumerate(alphabet):
    letter = {}
    for indx2, given in enumerate(alphabet):
        letter[given] = float(T[indx1][indx2])
    transition[prob] = letter
# transition['q']['u'] = 0.9949749 -> p(x_t = 'u' | x_{t-1} = 'q')
Tt = np.array([np.array(list(prob.values())) for prob in list(transition.values())])
# Tt[16] -> p(x_t| x_{t-1} = 'q')

## Forward Generator

Implements only forward with given conditional probabilities generative model.

In [4]:
def generate_string(transition, N):
    alphabet = list(transition.keys())
    # initial character is always '.'
    string = "."
    for i in range(1, N + 1):
        letter_probs = np.array(list(transition[string[i-1]].values()))
        letter_probs_normalized = np.true_divide(letter_probs, np.sum(letter_probs))
        string += np.random.choice(alphabet, p=letter_probs_normalized)[0]
    return string[1:]
print(generate_string(transition, np.random.randint(1, 50)))

ins.atichooponthe.thisorind.se.na


## Forward-Backward Generator

Utilizes ending character for generation by implementing forward-backward algorithm, expected to create better creations. Most likely implementation is wrong so does not seem to give better results. Also does not uses log-probabilities in calculations.

In [5]:
import re

def f_b(transition, sub_str):
    '''
    Forward-Backward algorithm for Markov-1
    '''
    alphabet = list(transition.keys())
    # propagate probability using initial condition(known letter)
    # prior is 1 and not considered since first letter is known
    f = [Tt[letter2idx[sub_str[0]]]]
    for i in range(len(sub_str) -2):
        f.append(np.multiply(f[i], Tt))
    for i in reversed(range(len(sub_str) -2)):
        # choose vector from probability matrix for known second letter
        probs = f[i+1][letter2idx[sub_str[i+2]]]
        # normalize probability vector
        probs = np.true_divide(probs, np.sum(probs))
        # sample a letter from alphabet with this probability
        sub_str = sub_str[:i+1] + np.random.choice(alphabet, p=probs)[0] + sub_str[i+2:]
    return sub_str

def complete_string(t, given_string):
    # add spaces to beginning and to the end
    given_string = '.' + given_string + '.'
    # find intervals to be filled
    sub_strs = [[match.start()-1, match.end()+1] for match in re.compile('_+').finditer(given_string)]
    # fill intervals by using generator one by one
    for r in sub_strs:
        given_string = given_string[:r[0]] + f_b(t, given_string[r[0]:r[1]]) + given_string[r[1]:]
    # return only filled string
    return given_string[1:-1]

for test_string in test_strings:
    print(test_string)
    print(complete_string(transition, test_string))

th__br__n.f_x.
th.rbr.on.fax.
_u_st__n_.to_be._nsw_r__
euest.ono.to.be.ensw.re.
i__at_._a_h_n_._e_r_i_g
ieeati.oaihanl.eeer.iag
q___t.___z._____t.__.___.__.
qoeat.ee.z.eeeeat.ee.ee..ee.


## Most-Likely Generator

Changes prediction part with argmax function which should give the most likely string and it is fixed since there is no polling. Then again there most likely  something is something wrong with implementation.

In [6]:
# Same implementation with above, only difference is
# using argmax in sampling instead of polling from distribution.
def viterbi(transition, sub_str):
    alphabet = list(transition.keys())
    f = [Tt[letter2idx[sub_str[0]]]]
    for i in range(len(sub_str) -2):
        f.append(np.multiply(f[i], Tt))
    for i in reversed(range(len(sub_str) -2)):
        probs = f[i+1][letter2idx[sub_str[i+2]]]
        probs = np.true_divide(probs, np.sum(probs))
        sub_str = sub_str[:i+1] + alphabet[np.argmax(probs)] + sub_str[i+2:]
    return sub_str

def complete_string_mostlikely(t, given_string):
    given_string = '.' + given_string + '.'
    sub_strs = [[match.start()-1, match.end()+1] for match in re.compile('_+').finditer(given_string)]
    for r in sub_strs:
        given_string = given_string[:r[0]] + viterbi(t, given_string[r[0]:r[1]]) + given_string[r[1]:]
    return given_string[1:-1]

for test_string in test_strings:
    print(test_string)
    print(complete_string_mostlikely(transition, test_string))

th__br__n.f_x.
thhnbr.on.f.x.
_u_st__n_.to_be._nsw_r__
euestuone.to.be.ensw.r..
i__at_._a_h_n_._e_r_i_g
ieeat..eaehhne.eeer.ieg
q___t.___z._____t.__.___.__.
qeeat.ee.z..ttttt.ee.eee.ee.


## Better Generator

Using higher level Markov model should give better results for these applications.