Name: Burak Suyunu

I hereby declare that I observed the honour code of the university when preparing the homework.

## Pr?gr?mm?ng?H?m?w?rk 3

In this exercise we model a string of text using a Markov(1) model. For simplicity we only consider letters 'a-z'. Capital letters 'A-Z' are mapped to the corresponding ones. All remaining letters, symbols, numbers, including spaces, are denoted by '.'.


We have a probability table $T$ where $T_{i,j} = p(x_t = j | x_{t-1} = i)$  transition model of letters in English text for $t=1,2 \dots N$. Assume that the initial letter in a string is always a space denoted as $x_0 = \text{'.'}$. Such a model where the probability table is always the same is sometimes called a stationary model.

1. For a given $N$, write a program to sample random strings with letters $x_1, x_2, \dots, x_N$ from $p(x_{1:N}|x_0)$
1. Now suppose you are given strings with missing letters, where each missing letter is denoted by a question mark (or underscore, as below). Implement a method, that samples missing letters conditioned on observed ones, i.e., samples from $p(x_{-\alpha}|x_{\alpha})$ where $\alpha$ denotes indices of observed letters. For example, if the input is 't??.', we have $N=4$ and
$x_1 = \text{'t'}$ and $x_4 = \text{'.'}$, $\alpha=\{1,4\}$ and $-\alpha=\{2,3\}$. Your program may possibly generate the strings 'the.', 'twi.', 'tee.', etc. Hint: make sure to make use all data given and sample from the correct distribution. Implement the method and print the results for the test strings below. 
1. Describe a method for filling in the gaps by estimating the most likely letter for each position. Hint: you need to compute
$$
x_{-\alpha}^* = \arg\max_{x_{-\alpha}} p(x_{-\alpha}|x_{\alpha})
$$
Implement the method and print the results for the following test strings along with the log-probability  $\log p(x_{-\alpha}^*,x_{\alpha})$.
1. Discuss how you can improve the model to get better estimations.

In [61]:
test_strings = ['th__br__n.f_x.', '_u_st__n_.to_be._nsw_r__','i__at_._a_h_n_._e_r_i_g','q___t.___z._____t.__.___.__.']

Hint: The code below loads a table of transition probabilities for English text.

In [62]:
import csv
from IPython.display import display, Latex

alphabet = [chr(i+ord('a')) for i in range(26)]
alphabet.append('.')
letter2idx = {c:i for i,c in enumerate(alphabet)}

T = []
with open('transitions.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        T.append(row)

print('Example')
## p(x_t = 'u' | x_{t-1} = 'q')
display(Latex(r"$p(x_t = \text{'u'} | x_{t-1} = \text{'q'})$"))
print(T[letter2idx['q']][letter2idx['u']])
display(Latex(r"$p(x_t | x_{t-1} = \text{'a'})$"))
for c,p in zip(alphabet,T[letter2idx['a']]):
    print(c,p)

Example


<IPython.core.display.Latex object>

0.9949749


<IPython.core.display.Latex object>

a 0.0002835
b 0.0228302
c 0.0369041
d 0.0426290
e 0.0012216
f 0.0075739
g 0.0171385
h 0.0014659
i 0.0372661
j 0.0002353
k 0.0110124
l 0.0778259
m 0.0260757
n 0.2145354
o 0.0005459
p 0.0195213
q 0.0001749
r 0.1104770
s 0.0934290
t 0.1317960
u 0.0098029
v 0.0306574
w 0.0088799
x 0.0009562
y 0.0233701
z 0.0018701
. 0.0715219


### Part 1

* Converted the string list into a float list and normalized to have a smoother distribution
* Then used each distribution as a parameter to np.random.choice() function to choose a letter.

In [64]:
import numpy as np
import matplotlib.pyplot as plt
import math

Tf = []

for t in T:
    tf = [float(i) for i in t]
    sm = sum(tf)
    tff = [i/sm for i in tf]
    Tf.append(tff)

In [65]:
def fun(N):
    text = "."
    for i in range(N):
        chrc = np.random.choice(alphabet, 1, p=Tf[letter2idx[text[i]]])
        text = text + chrc[0]
    print(text[1:len(text)])

In [66]:
fun(100)

.the.thed.f.temenowandin.capod.ss.ors.whousedourede.and.o.momenththe.itinigrs.hen.amant.sun.of..whol


### Part 2

To find the missing letters in this Markov(1) model we need to make an inference mechanism taht considers both the starting and the ending letters of the gap. Such inferenece can be represented as:

$p(X_1 | X_0 = x_0, X_{N+1} = x_{N+1}) \sim \sum_{X_2:X_N} p(X_{N+1} = x_{N+1} | X_N) \,\, p(X_N | X_{N-1}) \,\, ... \,\, p(X_2 | X_1) \,\, p(X_1 | X_0 = x_0)$

We can extract $p(X_1 | X_0 = x_0)$ part out of summation and to calculate rest of the summation, we need to take $N^{th}$ power of the transition matrix $(T^N)$. At the end, to find the probability distribution, we will multiply the row vector $\big(p(X_1 | X_0 = x_0)\big)$ obtained from $T$ with the column vector (The summation part) obtained from $(T^N)$. Then we will iteratively continue this process to find other missing letters from the gap.

For Part 2 we will choose letters randomly from the obtained distributions. On the other hand, we will choose the most probable letter from the distribution for Part 3. This will be the only difference (also calculating log-probability) between Part 2 and Part 3.

In [67]:
# Precalculating the powers of T and store them

Tpowers = []
Tpowers.append(1)

maxWordLength = len(test_strings[0])
for testString in test_strings:
    if len(testString) > maxWordLength:
        maxWordLength = len(testString)
        
for i in range(1, maxWordLength):
    p = np.dot(Tpowers[i-1], Tf)
    Tpowers.append(p)

In [77]:
# Finds the probability distribution for the given situation
def findProbDist(firstLetterInd, gap, lastLetterInd):
    probDist = np.multiply( Tf[firstLetterInd] , Tpowers[gap][: , lastLetterInd] )
    probDist = probDist / sum(probDist)
    return probDist


trials = 8

for testString in test_strings:
    print("Original: " + testString)
    
    for j in reversed(range(len(testString))):
        if (testString[j] != '_' and testString[j] != '?'):
            lastLetterInd = j
            break
    
    testString = '.' + testString
    
    for i in range(trials):
        newWord = ''
        
        # When we enter a gap, initialize inthegap = gap distance
        inthegap = 0
        
        for j in range(len(testString)):
            if testString[j] != '_' and testString[j] != '?':
                newWord = newWord + testString[j]
                
            elif j > lastLetterInd:
                chrc = np.random.choice(alphabet, 1, p=Tf[letter2idx[newWord[j-1]]])
                newWord = newWord + chrc[0]
                
            else:
                # uses inthegap variable to avoid calculating gap distance in every iteration
                if inthegap > 0:
                    probDist =  findProbDist(letter2idx[newWord[j-1]], inthegap, letter2idx[gapLastLetter])
                    
                    chrc = np.random.choice(alphabet, 1, p=probDist)
                    newWord = newWord + chrc[0]
                    inthegap = inthegap - 1
                    
                else:
                    inthegap = 0
                    for k in range(j, len(testString)):
                        if testString[k] == '_' or testString[k] == '?':
                            inthegap = inthegap + 1
                        else:
                            gapLastLetter = testString[k]
                            break
                            
                    probDist =  findProbDist(letter2idx[newWord[j-1]], inthegap, letter2idx[gapLastLetter])
                    
                    chrc = np.random.choice(alphabet, 1, p=probDist)
                    newWord = newWord + chrc[0]
                    inthegap = inthegap - 1
                    
        print("Sample" + str(i+1) + " : " + newWord[1:len(newWord)])
                    
    print()

Original: th__br__n.f_x.
Sample1 : thnubr.in.fex.
Sample2 : the.broan.f.x.
Sample3 : the.br.an.fex.
Sample4 : thitbrain.fex.
Sample5 : tho.bryon.fex.
Sample6 : the.bre.n.fex.
Sample7 : theabrean.fix.
Sample8 : the.brean.fix.

Original: _u_st__n_.to_be._nsw_r__
Sample1 : ounsthani.to.be.inswhr.s
Sample2 : ounsthany.to.be.wnswirwe
Sample3 : susstsan..to.be..nswernd
Sample4 : dumsthent.tombe.answoren
Sample5 : butsteing.to.be.answorou
Sample6 : oumstheny.to.be.answaris
Sample7 : surstlend.to.be.ensworoo
Sample8 : cursthind.to.be.onsw.rin

Original: i__at_._a_h_n_._e_r_i_g
Sample1 : ierato.lathand.hedrsing
Sample2 : in.ato.wathind.berrning
Sample3 : ircath.ha.hing.befrning
Sample4 : ishate.dathind.beereing
Sample5 : ishatw.fathine.leereing
Sample6 : i.fath.wathant.hearding
Sample7 : ineath.jathen..we.rhing
Sample8 : ie.ate.mathant.hepreing

Original: q___t.___z._____t.__.___.__.
Sample1 : quint.whaz.han.ft.id.lle.th.
Sample2 : qutat.byoz.to.but.ss.at..we.
Sample3 : qud.t.an.z.teat.t.pe.whe

### Part 3

We are going to apply the same processes with Part 2 but now we will choose the most probable letters from the distributions rather than randomizing. We will be also calculating the log-probabilities.

In [78]:
for testString in test_strings:
    print("Original : " + testString)
    
    for j in reversed(range(len(testString))):
        if (testString[j] != '_' and testString[j] != '?'):
            lastLetterInd = j
            break
    
    testString = '.' + testString
        
    newWord = ''
    logProb = 1
    
    # When we enter a gap, initialize inthegap = gap distance
    inthegap = 0
    
    for j in range(len(testString)):
        if testString[j] != '_' and testString[j] != '?':
            newWord = newWord + testString[j]
            
        elif j > lastLetterInd:
            chrc = np.random.choice(alphabet, 1, p=Tf[letter2idx[newWord[j-1]]])
            newWord = newWord + chrc[0]
            
        else:
            # uses inthegap variable to avoid calculating gap distance in every iteration
            if inthegap > 0:
                probDist =  findProbDist(letter2idx[newWord[j-1]], inthegap, letter2idx[gapLastLetter])

                logProb = logProb * max(probDist)
                
                newWord = newWord + alphabet[np.argmax(probDist)]
                inthegap = inthegap - 1
                
            else:
                inthegap = 0
                for k in range(j, len(testString)):
                    if testString[k] == '_' or testString[k] == '?':
                        inthegap = inthegap + 1
                    else:
                        gapLastLetter = testString[k]
                        break
                        
                probDist =  findProbDist(letter2idx[newWord[j-1]], inthegap, letter2idx[gapLastLetter])
                logProb = logProb * max(probDist)

                newWord = newWord + alphabet[np.argmax(probDist)]
                inthegap = inthegap - 1
                
    print("Best Fit : " + newWord[1:len(newWord)])
    print("Log-Prob : " + str(np.log(logProb)))
    print()

Original : th__br__n.f_x.
Best Fit : the.br.an.fex.
Log-Prob : -3.07433488138

Original : _u_st__n_.to_be._nsw_r__
Best Fit : oursthend.to.be.answerch
Log-Prob : -8.32284546031

Original : i__at_._a_h_n_._e_r_i_g
Best Fit : in.ath.wathend.he.r.ing
Log-Prob : -11.6360900332

Original : q___t.___z._____t.__.___.__.
Best Fit : qur.t.thiz.the.at.an.the.an.
Log-Prob : -22.9236427638



### Part 4
As an improvment we may use a higher order Markov rather than Markov(1). With using a higher order we can observe more complex structers in the words. So we can find better results. However using a higher order Markov's cost is adding up exponentially. So we should find a balance between precision and efficiency.