Name: Berk Görgülü

I hereby declare that I observed the honour code of the university when preparing the homework.

## Pr?gr?mm?ng?H?m?w?rk 3

In this exercise we model a string of text using a Markov(1) model. For simplicity we only consider letters 'a-z'. Capital letters 'A-Z' are mapped to the corresponding ones. All remaining letters, symbols, numbers, including spaces, are denoted by '.'.


We have a probability table $T$ where $T_{i,j} = p(x_t = j | x_{t-1} = i)$  transition model of letters in English text for $t=1,2 \dots N$. Assume that the initial letter in a string is always a space denoted as $x_0 = \text{'.'}$. Such a model where the probability table is always the same is sometimes called a stationary model.

1. For a given $N$, write a program to sample random strings with letters $x_1, x_2, \dots, x_N$ from $p(x_{1:N}|x_0)$
1. Now suppose you are given strings with missing letters, where each missing letter is denoted by a question mark (or underscore, as below). Implement a method, that samples missing letters conditioned on observed ones, i.e., samples from $p(x_{-\alpha}|x_{\alpha})$ where $\alpha$ denotes indices of observed letters. For example, if the input is 't??.', we have $N=4$ and
$x_1 = \text{'t'}$ and $x_4 = \text{'.'}$, $\alpha=\{1,4\}$ and $-\alpha=\{2,3\}$. Your program may possibly generate the strings 'the.', 'twi.', 'tee.', etc. Hint: make sure to make use all data given and sample from the correct distribution. Implement the method and print the results for the test strings below. 
1. Describe a method for filling in the gaps by estimating the most likely letter for each position. Hint: you need to compute
$$
x_{-\alpha}^* = \arg\max_{x_{-\alpha}} p(x_{-\alpha}|x_{\alpha})
$$
Implement the method and print the results for the following test strings along with the log-probability  $\log p(x_{-\alpha}^*,x_{\alpha})$.
1. Discuss how you can improve the model to get better estimations.

In [1]:
test_strings = ['th__br__n.f_x.', '_u_st__n_.to_be._nsw_r__','i__at_._a_h_n_._e_r_i_g','q___t.___z._____t.__.___.__.']

Hint: The code below loads a table of transition probabilities for English text.

In [2]:
import csv
from IPython.display import display, Latex

alphabet = [chr(i+ord('a')) for i in range(26)]
alphabet.append('.')
letter2idx = {c:i for i,c in enumerate(alphabet)}

T = []
with open('transitions.csv') as csvfile:
    reader = csv.reader(csvfile, delimiter=',')
    for row in reader:
        T.append(row)

print('Example')
## p(x_t = 'u' | x_{t-1} = 'q')
display(Latex(r"$p(x_t = \text{'u'} | x_{t-1} = \text{'q'})$"))
print(T[letter2idx['q']][letter2idx['u']])
display(Latex(r"$p(x_t | x_{t-1} = \text{'a'})$"))
for c,p in zip(alphabet,T[letter2idx['a']]):
    print(c,p)

Example


<IPython.core.display.Latex object>

0.9949749


<IPython.core.display.Latex object>

('a', '0.0002835')
('b', '0.0228302')
('c', '0.0369041')
('d', '0.0426290')
('e', '0.0012216')
('f', '0.0075739')
('g', '0.0171385')
('h', '0.0014659')
('i', '0.0372661')
('j', '0.0002353')
('k', '0.0110124')
('l', '0.0778259')
('m', '0.0260757')
('n', '0.2145354')
('o', '0.0005459')
('p', '0.0195213')
('q', '0.0001749')
('r', '0.1104770')
('s', '0.0934290')
('t', '0.1317960')
('u', '0.0098029')
('v', '0.0306574')
('w', '0.0088799')
('x', '0.0009562')
('y', '0.0233701')
('z', '0.0018701')
('.', '0.0715219')


### 1. Sampling Random Strings

In [3]:
import numpy as np
import math

#Sum of the probabilites for some columns are sometimes slightyly more than 1.(ex. 1.0000523)
#Normalize function normalizes the probabilities to exactly 1
def Normalize(K):
    for i in range(len(K)):
        temp=map(float,K[i])
        K[i] = [float(j)/sum(temp) for j in temp]
    return K

#Samples from the distribution applying Markov(1) and returns a sequence with given number of letters
#sampleString generates letters one by one and takes the last one as given in order to generate the next one

def sampleString(N):
    s='.'
    seq=''
    K=Normalize(T)
    for n in range(0,N):
        probs=K[letter2idx[s]]
        s = alphabet[np.random.choice(np.arange(0, 27), p=probs)]
        seq=seq+s
    return seq

#Samples from the distribution applying Markov(1) and returns a sequence with given number of letters
#sampleString generates letters takes only the '.' as given and generate each sequence by multiplying the transition probabilities
def sampleStringFromInitial(N):
    s='.'
    seq=''
    K=Normalize(T)
    TransitionM = np.vstack(K).astype(float)
    for n in range(0,N):
        seq = seq + alphabet[np.random.choice(np.arange(0, 27), p=TransitionM[letter2idx[s]])]
        TransitionM = np.dot(TransitionM,K)
    return seq

display(Latex(r'Sample String Given $S_0$ only')) 
print sampleStringFromInitial(1000)


display(Latex(r'Sample String Given $S_{i-1}$')) 
print sampleString(1000)
    

<IPython.core.display.Latex object>

m.yyutysufnh.ihrtrtdah..onshislcauid.e..rade.hb.r.fya..htteot.pnvcroalcntgeles.redmopenirt..n.ma.ae.loin.oe.dt..wreeec..lahiecabhh..te.lt..iteandrth.uu..wrg.unarn.itltnifilrso.twft..tiasosotp.ysg.ndmi.a.r..fnnl.btbn.ah.so..lbkhtgentt.imrr.euenodslwghw.s.hene.ot.bnhlhtm.eiea.otsn.enu.od.dreboo..o.n.g.hc.f..adcaonpcrm.o.etnsrrdentl.lohsai.w.sa.u.i.e.tlh.ehblsar.atossl.ivfeiros.ohlidhttevetnwe.r...v.to.oenhlria.thtwihur.ein.nhida.uil.hnrhandi.tulrridishwiy.uv.thacnet.eaiaarieetellis.h.tchs.loeoti.erti.rtnoreagmetgel.o.tttlmso.desphownwah.sm.ureae.ttsmao.e.oivifstkh.eyoanvrpbrfrsoef..c.tn.enciegb.lceai..ednulser.atca.ttlaotdau..b.in.te.eh.hemst..rhnvgmorwa..i.h.erd..leehno.etmepgdafg..r.tscrrlttdn.e.peerrti.thh.nxcmt.nnor..re.enhbc.b.iise.ileto.en.gd.phi.aeii..oi.mxaoiclas.itiao.o.cd.setulee.aelhsimaeetaat.ftcnsi.ib.drtnwhytgc.hltet.l.sssith.sp.o.dapocihcairhossnwx.ce..ao..eaathehw.dapfe.e.naare.hii.ttl.ohitstedouwg.rohtdsde.su.e.onfmfo.o.eehna.eiiweed.cdiea.aama.hkohofnif.r.k..eb.s.s..alt

<IPython.core.display.Latex object>

um.gs.an.mpl.ghis.matin.e.hendre.urimo.ts.g.ay.are.in.tiof.sed.sat.o.d.hanor.ca.on.bet.thoothet.wong.sithapr.tame.mee.ang.h.iphe.cot.anu.offfrteto.pret.wothand.thecce.mis.haf.d.se.thancqur.trk.overesterexorenond..s.afobl.lermpsoed.oandondutofamat.y.mas.mee.d.f.hecer.cind.benchedinjoieristhastee.s.s.aud.s.int.nd.t.he.anuthten.icher.s.bunglatteff.hel..d.ancanchatheesey.parivil.be.cas.aut.s.wepl.he.anththiltoyove.e.juthes.be.ule..ire..comy.ofo.inegrei.wig.omas.hitifin.coshe.stersa.at.sendimstieelantroruche.g.bed..cermorel.f.wntan..ied.van.tr.ietesen.fofo.akndly.f.steay.he.trtson.h.apif.be.anane.ach.derzad.ount..inere..sirshe.cord.by.anly.a.id.utour.ter.t.he...thyoraroonthe.cata.are.s.tite.rcildo.herselovecck.aved.f.t.jat.ary.h.ch..iny.fom.hed.thanceveaberar.eequshithireindas.fery.foleratomingh.belouth.exve.e.ari.twofo.wilelendeas.trr.d.meeserd.an.hut.th.cteds.thincom.wnait.g.pe.an.a..al.met.aldin.s.feseaca.chthese.fise.irarmperechi.tindas.ialorrhrillld.cowonkn.pan.te.cqube..theas.ghanung.

### 2. Fill the Missing Letters by Sampling

### 2.1.Model

Define all letter sequences as state $S_t$. 
Since we use Markov(1) model we only need to look at the closest given letters from both sides to fill the missing letters. 

Assume that $S_i$ is not given and we want to predict $S_i$.
$S_l$ is the closest given letter to $S_i$ from left side and $S_r$ is the closest given letter from the right side.

Then:

If both $S_l$ and $S_r$ exist:

$P\left(S_i\ |\ S_l=l,S_r=r\right) = \frac{P\left(S_i,S_l=l,S_r=r\right)}{P\left(S_l=l,S_r=r\right)}$

$= \frac{P\left(S_l=l\right)*P\left(S_i\ |\ S_l=l\right)*P\left(S_r=r\ |\ S_i\right)}{P\left(S_r=r\ |\ S_l=l\right)*P\left(S_l=l\right)}$

$= \frac{P\left(S_i\ |\ S_l=l\right)*P\left(S_r=r\ |\ S_i\right)}{P\left(S_r=r\ |\ S_l=l\right)}$

If only $S_l$ exists:

$P\left(S_i\ |\ S_l=l\right) = \frac{P\left(S_i,S_l=l\right)}{P\left(S_l=l\right)}$

$= \frac{P\left(S_l=l\right)*P\left(S_i\ |\ S_l=l\right)}{P\left(S_l=l\right)}$

$=P\left(S_i\ |\ S_l=l\right)$

Since Markov(1) model is used ,conditional probabilities represented above are calculated by taking the powers of the transition matrix depending on the distance between the indexes of given letters and the indexes of the empty letters.

In this part, after calculating the desired probability, sampling is done from the obtained probability distribution.

### 2.2.Code & Results

In [5]:
def findUnknownIndex(st):
    unknown=list()
    known = list()
    for i in range(0,len(st)):
        if st[i] is '_' or st[i] is '?' :
            unknown.append(i)
        else:
            known.append(i)
    return unknown,known
def findNeighbors(k,st):
    i=1
    sn=None
    bn=None
    while st[k-i]:
        if st[k-i] is not '_' and st[i] is not '?':
            sn=k-i
            break
        i=i+1
    i=1
    try:
        while st[k+i]:
            if st[k+i] is not '_' and st[i] is not '?':
                bn=k+i
                break
            i=i+1
    except:
        bn = None
    return sn,bn
def predictMissingLetters(st,sample=False):
    probMatrix=Normalize(T)
    st = '.'+st
    unknownIndexes,knownIndexes = findUnknownIndex(st)
    p = list()
    missingValues=dict()
    for k in unknownIndexes:
        sn,bn=findNeighbors(k,st)
        
        if sn is not None and bn is not None:
            forward = np.vstack(probMatrix).astype(float)
            backward = np.vstack(probMatrix).astype(float)
            normalizer = np.vstack(probMatrix).astype(float)
            for i in range(1,k-sn):
                forward=np.dot(forward,np.vstack(probMatrix).astype(float))
            for i in range(1,bn-k):
                backward=np.dot(backward,np.vstack(probMatrix).astype(float))
            for i in range(1,bn-sn):
                normalizer=np.dot(normalizer,np.vstack(probMatrix).astype(float))
            bm = backward[:,letter2idx[st[bn]]]
            fm = forward[letter2idx[st[sn]],:]
            nm = normalizer[letter2idx[st[sn]]][letter2idx[st[bn]]]
            predictionProb = np.multiply(bm,fm)/nm
            if sample:
                st=list(st)
                st[k]=alphabet[np.random.choice(np.arange(0, 27), p=predictionProb)]
                st="".join(st)
            else:
                st=list(st)
                st[k]=alphabet[predictionProb.tolist().index(max(predictionProb))]
                p.append(max(predictionProb))
                st="".join(st)
        else:
            forward = np.vstack(probMatrix).astype(float)
            for i in range(1,k-sn):
                forward=np.dot(forward,np.vstack(probMatrix).astype(float))
            fm = forward[letter2idx[st[sn]],:]
            predictionProb = fm
            if sample:
                st=list(st)
                st[k]=alphabet[np.random.choice(np.arange(0, 27), p=predictionProb)]
                st="".join(st)
            else:
                st=list(st)
                st[k]=alphabet[predictionProb.tolist().index(max(predictionProb))]
                p.append(max(predictionProb))
                st="".join(st)
    total = 1
    for x in p:
        total *= x   
    return st[1:len(st)],math.log(total)

def sampleMissingLetters(st,Ntrial):
    for k in st:
        for trial in range(0,Ntrial):
            print 'Text ',st.index(k)+1," ",'Sample ',trial+1," : ",predictMissingLetters(k,sample=True)[0]
            
def predictMostLikelyMissingLetters(st):
    for k in st:
        txt,probs=predictMissingLetters(k,sample=False)
        print 'String: ',txt
        print 'Log-Probability: '
        print probs
sampleMissingLetters(test_strings,20)

Text  1   Sample  1  :  the.brsun.fix.
Text  1   Sample  2  :  the.br.an.fix.
Text  1   Sample  3  :  the.broun.fex.
Text  1   Sample  4  :  the.brtin.fex.
Text  1   Sample  5  :  thoubrean.fix.
Text  1   Sample  6  :  thy.brken.fex.
Text  1   Sample  7  :  thtabr.an.fix.
Text  1   Sample  8  :  the.brlin.fex.
Text  1   Sample  9  :  the.brown.fex.
Text  1   Sample  10  :  tho.brcon.fex.
Text  1   Sample  11  :  thtabrken.fex.
Text  1   Sample  12  :  thoubrean.fix.
Text  1   Sample  13  :  the.brdon.fex.
Text  1   Sample  14  :  the.brkin.fix.
Text  1   Sample  15  :  the.brean.fex.
Text  1   Sample  16  :  the.brron.fox.
Text  1   Sample  17  :  the.breen.fex.
Text  1   Sample  18  :  the.br.an.fix.
Text  1   Sample  19  :  the.br.in.f.x.
Text  1   Sample  20  :  the.brien.fex.
Text  2   Sample  1  :  tusstriny.to.be.onswore.
Text  2   Sample  2  :  aulstheno.tombe.inswore.
Text  2   Sample  3  :  ouasthind.to.be.insware.
Text  2   Sample  4  :  ouesthend.toube.inswarou
Text  2   Sam

### 3. Fill the Missing Letters By Most Likely Letter

In [6]:
predictMostLikelyMissingLetters(test_strings)

String:  the.br.an.fex.
Log-Probability: 
-3.07433488138
String:  oursthend.to.be.answere.
Log-Probability: 
-10.8154510862
String:  in.ath.wathend.he.r.ing
Log-Probability: 
-11.6360900332
String:  qur.t.thiz.the.at.an.the.an.
Log-Probability: 
-22.9236427638


### 4. How to Improve the Model

In this model we are using Markov(1) model, this means that we are considering only the closest given letters from both sides. Although it is a very simple approach and efficient to calculate, a letter in a word is also related to the other words in a model. Sequencing of letters in a word is very important feature that should be considered. Therefore using higher order Markov model possibly give better results.
Also punctuations and uppercase letters should be implemented in the model. Transition probabilities for punctuations should be added since they also carry significant information in terms of predicting the previous and the next letter. Uppercase letters carry some information too. Begining words of the sentences would change the transition matrix and make it more useful.
