# 1 问题引入

采用传统的机器翻译方法时，手工编制一套双语词典及翻译规则是十分困难的。在统计机器翻译出现以后，我们采用的方法是从大量的平行语料或双语语料中获取翻译知识。基于语料的机器翻译首先需要根据一些对齐规则进行句子级的对齐(sentence alignment)。下面将对翻译模型和句子对齐问题进行更深入的讨论。

# 2 计算模型

## 2.1 噪声信道模型
假设源语言句子是由某个目标语言的句子经过噪声信道传播得到，使用贝叶斯方法就可以找到最可能产生该源语言句子的目标语言句子，及将源语言句子$f=f_1,f_2,\cdots,f_m$翻译到目标语言句子$e=e_1,e_2,\cdots,e_l$使得$P(E|F)$最大化：
$$
\begin{align*}
\hat{e}&=argmax_{e\in English} P(e|f) \\
        &=argmax_{e\in English} \frac{P(f|e)P(e)}{P(f)} \\
        &=argmax_{e\in English} P(f|e)P(e) \\
\end{align*}
$$
噪声信道模型的主要由3部分构成：翻译模型，语言模型和解码器。

## 2.2 语言模型
可以采用n-gram语言模型计算$P(e)$，也可以采用较复杂的PCFG语言模型来捕获长距离相依特性来计算$P(e)$。

## 2.3 翻译模型-IBM Model 1
从英语句子$e$生成一个外文句子$f$:

1.依据概率$\frac{1}{(l+1)^m}$挑选一种对齐方式。

2.依据下列概率选择外文句子：
$$
p(f|a,e,m)=\prod_{j=1}^m t(f_j|e_{a_j})
$$
则：
$$
p(f,a|e,m)=\frac{1}{(l+1)^m}\prod_{j=1}^m t(f_j|e_{a_j})
$$
最后得到：
$$
p(f|e,m)=\sum_{a\in A}p(f,a|e,m)
$$

3.对于给定的$(f,e)$对，可以计算某种对齐$a$的概率：
$$
p(a|f,e,m)=\frac{p(a|e,m)p(f|a,e,m)}{\sum_{a\in A}p(f,a|e,m)}
$$

4.进而，最可能的对齐方式为：
$$
a^*=argmax_a p(a|f,e,m)
$$

## 2.4 翻译模型-IBM Model 2
第二种模型与第一种模型的主要区别就是引入了对齐时扭曲系数：$q(i|j,l,m)$，表示给定$e$和$f$的长度分别为$l$和$m$时，第$j$个外文次和第$i$个英文词对齐的概率。下面是从英语句子$e$生成一个外文句子$f$的过程:

1.依据如下概率选择一种对齐方式：$a=\{a_1,a_2,\cdots,a_m\},\prod_{j=1}^m q(a_j|j,l,m)$

2.依据如下概率选择一个外文句子$f$：
$$
p(f|a,e,m)=\prod_{j=1}^m t(f_j|e_{a_j})
$$
进而得到：
$$
p(f,a|e,m)=\prod_{j=1}^m q(a_j|j,l,m)t(f_j|e_{a_j})
$$
最后得到：
$$
p(f|e,m)=\sum_{a\in A}p(f,a|e,m)
$$

3.如果已经得到参数$q$和$t$，则对于每个句对$e_1,e_2,\cdots,e_l,f_1,f_2,\cdots,f_m$，其最优对齐$a_j$为：
$$
a_j=argmax_{a\in\{0,\cdots,l\}} q(a|k,l,m)t(f_j|e_a)
$$

但是，如果我们的训练语料仅包含英文句子和外文句子，而不包含对齐方式，我们需要使用EM算法来估计参数。

1.通过迭代计算模型参数$q$和$t$。从一个初始值出发，每次迭代时根据训练数据和当时的$q,t$计算counts，再依据当前的counts重新估计$q,t$。

2.每次迭代时，依据下式计算$\delta(k,i,j)$:
$$
\delta(k,i,j)=\frac{q(j|i,l_k,m_k)t(f_i^{(k)}|e_j^{(k)})}{\sum_{j=0}^{l_k}q(j|i,l_k,m_k)t(f_i^{(k)}|e_j^{(k)})}
$$

# 3 编程实现

In [1]:
# coding=utf-8-sig
import codecs
import string
import os
import numpy as np
from nltk import word_tokenize
from operator import itemgetter
from collections import defaultdict

In [13]:
class IBMModel:
    def __init__(self, opt=2, num=100000):
        self.data,self.en,self.cn = self.load_data(num)
        self.t = self.init_t()
        self.q= self.init_q()
        self.conditional_dict = defaultdict(list)
        self.opt = opt

    # load txt data，map from en sentences to cn sentences
    def load_data(self, num=100000):
        print("Start loading data...")
        en = []
        cn = []
        file_list = []
        file_list.append(codecs.open('en.txt', "r", "utf-8"))
        file_list.append(codecs.open('cn.txt', "r", "utf-8"))
        i = 0
        data = {}
        while i < num:
            sentence_en = word_tokenize("NULL " + file_list[0].readline().strip("\n").lower())
            sentence_cn = word_tokenize("NULL " + file_list[1].readline().strip("\n").lower())
            sentence_en = [s for s in sentence_en if not s in string.punctuation]
            sentence_cn = [s for s in sentence_cn if not s in string.punctuation]
            if i == 0:
                lyh = sentence_en[1]
                ryh = sentence_en[23]
            sentence_en = [s for s in sentence_en if not s == lyh]
            sentence_cn = [s for s in sentence_cn if not s == lyh]
            sentence_en = [s for s in sentence_en if not s == ryh]
            sentence_cn = [s for s in sentence_cn if not s == ryh]
            sentence_en = tuple(sentence_en)
            sentence_cn = tuple(sentence_cn)
            data[sentence_en] = sentence_cn
            en.append(sentence_en)
            cn.append(sentence_cn)
            i += 1
        print("Finish loading data!")
        return data, en, cn

    # init t
    def init_t(self):
        num_of_words = len(set(f_word for (english_sent, foreign_sent) in self.data.items() for f_word in foreign_sent))
        t = defaultdict(lambda: float(1 / num_of_words))
        return t
    # init distortion parameter
    def init_q(self):
        q = defaultdict(lambda: float(1/100))
        return q

    def fit(self, max_iter=5):
        print("Start fitting...")
        for n in range(max_iter):
            count_e_given_f = defaultdict(float)
            count_i_give_j = defaultdict(float)
            qtotal = defaultdict(float)
            total = defaultdict(float)
            sentence_total = defaultdict(float)
            for o in range(len(self.en)):
                english_sent=self.en[o]
                foreign_sent=self.cn[o]
                l1 = len(english_sent)
                l2 = len(foreign_sent)
                for i in range(len(foreign_sent)):
                    for j in range(len(english_sent)):
                        if self.opt == 1:
                            self.q[(j,i,l1,l2)] = 1/(l1+1)**l2
                        sentence_total[(i,l1,l2)] += self.q[(j,i,l1,l2)]*self.t[(foreign_sent[i],english_sent[j])]
                for i in range(len(foreign_sent)):
                    for j in range(len(english_sent)):
                        delta = self.t[(j,i,l1,l2)]*self.t[(foreign_sent[i],english_sent[j])]/sentence_total[(i,l1,l2)]
                        count_e_given_f[(foreign_sent[i],english_sent[j])] += delta
                        total[(english_sent[j])] += delta
                        count_i_give_j[(i,j,l1,l2)] += delta
                        total[(i,l1,l2)] += delta
                for i in range(len(foreign_sent)):
                    for j in range(len(english_sent)):
                        self.t[(foreign_sent[i],english_sent[j])] = count_e_given_f[(foreign_sent[i],english_sent[j])]/total[(english_sent[j])]
                        if self.opt == 2:
                            self.q[(j,i,l1,l2)] = count_i_give_j[(i,j,l1,l2)]/total[(i,l1,l2)]
            print("iter = " + str(n))
        print("Finish fitting!")

    # find the best alignments
    def get_alignments(self):
        a = []
        for t in range(len(self.en)):
            english_sent = self.en[t]
            foreign_sent = self.cn[t]
            a.append([])
            for k in range(len(foreign_sent)):
                a[t].append(0)
            for i in range(len(foreign_sent)):
                p = 0
                for j in range(len(english_sent)):
                    if self.t[(foreign_sent[i], english_sent[j])]*self.q[(j,i,len(english_sent),len(foreign_sent))] > p:
                        p = self.t[(foreign_sent[i], english_sent[j])]*self.q[(j,i,len(english_sent),len(foreign_sent))]
                        a[t][i] = j
        return a

    def print_t(self, max_iter=30):
        iterations = 0
        for ((f_word, e_word), value) in sorted(self.t.items(), key=itemgetter(1), reverse=True):
            if iterations < max_iter:
                print("{}, {}".format("t(%s|%s)" % (f_word, e_word), value))
            else:
                break
            iterations += 1
        i= 0
#         for m in self.q.keys():
#             if i<600:
#                 print(m,self.q[m])
#                 i+=1
#             else:
#                 break

In [17]:
ibm_1 = IBMModel(1)
ibm_1.fit()

Start loading data...
Finish loading data!
Start fitting...
iter = 0
iter = 1
iter = 2
iter = 3
iter = 4
Finish fitting!


In [18]:
ibm_1.print_t()
a = ibm_1.get_alignments()
print(a[:50])

t(--|happy), 0.9999999996852461
t(倒杯水|again), 0.9999999992319613
t(一月|often), 0.9999999973003395
t(销售部|please), 0.9999999876060397
t(问|alone), 0.9999999862277923
t(鞋子|heavy), 0.9999998843037965
t(灵便|shoulder), 0.9999997839828911
t(根由|source), 0.9999996961339709
t(有效|off), 0.9999989015575106
t(的|unfavorable), 0.9999988992183628
t(的|recurrence), 0.9999977666862465
t(海军|admiral), 0.9999976739504861
t(车门|all), 0.999997018636993
t(的|up-to-date), 0.9999969996832432
t(的|jumbled), 0.9999968611012439
t(斑马|zebra), 0.9999967178870982
t(利|small), 0.999996421746315
t(的|contingent), 0.9999963125579988
t(她说|always), 0.9999955120051789
t(，|rallied), 0.9999945217732751
t(，|begged), 0.9999942232334081
t(的|dwelt), 0.9999937563725585
t(的|rash), 0.9999928766103096
t(犯错误|all), 0.9999921090346664
t(的|unseen), 0.9999905415153532
t(的|copyright), 0.9999902584718418
t(的|sadness), 0.9999902425725843
t(明白地|door), 0.9999899167350783
t(的|instilled), 0.9999897176967448
t(摩尔多瓦共和国|moldova), 0.9999891268364945
[[18, 9, 

In [14]:
ibm_2 = IBMModel(2)
ibm_2.fit()

Start loading data...
Finish loading data!
Start fitting...
iter = 0
iter = 1
iter = 2
iter = 3
iter = 4
Finish fitting!


In [16]:
ibm_2.print_t()
a = ibm_2.get_alignments()
print(a[:50])

t(摩尔多瓦共和国|moldova), 0.9999992948607731
t(布隆迪共和国|burundi), 0.9999989149897981
t(爱沙尼亚共和国|estonia), 0.9999938253291653
t(波特率|baud), 0.9999799966783909
t(斑马|zebra), 0.9999798580057734
t(几内亚比绍共和国|guinea-bissau), 0.9999765154982379
t(马尔代夫共和国|maldives), 0.9999624478024018
t(斯|jones), 0.9999511356807737
t(基因图谱|genetic), 0.9999430221353006
t(巴布亚新几内亚|papua), 0.9999400384390129
t(巴布亚新几内亚|gunea), 0.9999400384390129
t(急|written), 0.999927675688751
t(毛里求斯共和国|mauritius), 0.9999206859579925
t(黎巴嫩共和国|lebanese), 0.9999193731255746
t(瓦努阿图共和国|vanuatu), 0.9999140007667047
t(卢旺达共和国|rwandese), 0.9998857631853852
t(忽然|abruptly), 0.9998837040663799
t(就在|telephone), 0.999879844550907
t(太远|furthermore), 0.9998769598829513
t(出门|rolled), 0.9998616861289754
t(赞比亚共和国|zambia), 0.9998575905864929
t(成吉思汗|genghis), 0.9998485568210402
t(斯洛伐克共和国|slovak), 0.9998437234642802
t(海军|naval), 0.9998311515032571
t(计算机病毒|virus), 0.9998226789267802
t(越南社会主义共和国|viet), 0.9998064210615605
t(越南社会主义共和国|nam), 0.9998064210615605
t(冷|far),

# 4 模型评估