### MT Lab 3: Decoder for edit phrase table

#### TranslationModel

In [1]:
from collections import namedtuple

We define a `phrase` type for the possibly translated phrase with phrase itself and the probability.

In [2]:
phrase = namedtuple("phrase", "text, logprob")

Write your translation model loading function. We recommend you make use of the `phrase` type.

In [3]:
def load_tm(filename, topK = 20):
    # write your code here...
    tm = {}
    for line in open(filename):
        (f, e, logprob) = line.strip().split(" ||| ")
        tm.setdefault(tuple(f.split()), []).append(phrase(e, float(logprob)))
    for f in tm:  # prune all but top k translations
        tm[f].sort(key=lambda x: -x.logprob)
        del tm[f][topK:]
    return tm

Load translation model (edit phrase table in lab2)

In [5]:
tm = load_tm('phrase_table.dev.txt')

Test if your tm is correct

In [6]:
tm['discuss', 'about']

[phrase(text='discuss about', logprob=-0.4721),
 phrase(text='discuss', logprob=-1.3376)]

#### Language Model

We use KenLM as our language model for decoder.

In [7]:
import kenlm

In [8]:
# lm = kenlm.Model('bnc.prune.bin')
lm = kenlm.Model('/home/nlplab/jjc/1b.bin')

In [9]:
lm.score("We can discuss about it .")

-12.276196479797363

In [9]:
lm.score("We can discuss it .")

-9.702674865722656

In [16]:
lm.score("discuss about")

-13.259819030761719

In [15]:
lm.score("discuss about", bos=False, eos=False)

-7.445827007293701

In [36]:
lm.end

AttributeError: 'kenlm.Model' object has no attribute 'end'

#### Read data

In [10]:
src_sents = [tuple(line.strip().split()) for line in open('src.txt')]

You can initialize unknown words as-is with a default log probability

In [1]:
# init unknown words probabilty for tm here...

or do it later in the decoding process. (use `dict.get`)

#### Decode

In [None]:
# The following code implements a monotone decoding
# algorithm (one that doesn't permute the target phrases).
# Hence all hypotheses in stacks[i] represent translations of 
# the first i words of the input sentence. You should generalize
# this so that they can represent translations of *any* i words.

We define a `hypothesis` type for the possibly translated phrase with phrase itself and the probability.

In [11]:
hypothesis = namedtuple("hypothesis", "logprob, tmprob, state, predecessor")

devdata

In [12]:
dev_sent = 'We can discuss about it .'
src_sent = tuple(dev_sent.split())

In [13]:
src_sent[0:1]

('We',)

Write your decode algorithm

In [15]:
PRUN = 10
# for line in src_sent:
initial_hypothesis = hypothesis(0.0, 0.0, '', None)
stacks = [{} for _ in range(len(src_sent)+1)]
stacks[0][''] = initial_hypothesis

for i, stack in enumerate(stacks[:-1]):
    for h in sorted(stack.values(), key=lambda h: -h.logprob)[:PRUN]:
        for j in range(i+1, len(src_sent)+1):
            if src_sent[i:j] in tm:
                for phrase in tm[f[i:j]]:
                    logprob = h.logprob + phrase.logprob
                    lm_state = h.lm_state
                    for word in phrase.text.split():
                        (lm_state, word_logprob) = lm.score(lm_state, word)
                        logprob += word_logprob
                    new_hypothesis = hypothesis(logprob, lm_state, h, phrase)
                    if lm_state not in stacks[j] or stacks[j][lm_state].logprob < logprob:  
                        # second case is recombination
                        # if new logprob is much better
                        stacks[j][lm_state] = new_hypothesis
    
    
    
    # write your code here...

select the one with highest score

In [17]:
winner = max(...)

pirnt decode result

In [18]:
print(' '.join(src_sent))
print(winner.state)

We can discuss about it .
We can discuss it .


In [19]:
print("LM = {}, TM = {}, Total = {}".format(lm.score(winner.state), winner.tmprob, winner.logprob))

LM = -9.702674865722656, TM = -5.3376, Total = -15.040274865722656
