In [2]:
import time
print(f"Last updated on {time.asctime(time.gmtime())} UTC")

Last updated on Wed Dec 25 08:45:12 2019 UTC


### Purpose

The purpose this notebook serves is to show the design process of the Back-off n-grams Language Models (BNLM) with enhancement from Neural Network Language Models (NNLM) described in the paper "Neural Network Language Model for Chinese Pinyin Input Method Engine" by Chen et all. The task at hand with this language model is given a sequence of syllables, to predict which is most likely the next syllable, a task also known as candidate sentence generation. The model is to be implemented into HKIME, an intelligent input method for Cantonese. There would be three sections in this notebook, each building upon the previous. 

### Breakdown

- Section 1: Basic n-grams using particle filtering
- Section 2: Back-off n-grams language model with interpolated Kneser-Ney smoothing
- Section 3: BNLM (from section 2) with probabilities calculated with NNLM

In [11]:
import random

# Jyutping Corpus Processing

In [34]:
#TODO: Add corpus processing
#TODO: Find better corpuses
#TODO: Look into webscraping to generate corpuses ourselves

FILENAME = "sources/training_data.txt"

with open(FILENAME, "r",encoding = "gb18030", errors='ignore') as f:
    content = f.read()
    jyutping_corpus = [e+"。" for e in content.split("。")]
print(jyutping_corpus)

['綜合報道-----------------------------【78頁電子書●你不能不知的毛澤東】老毛秘史和他的女人們 熟讀紅色抗爭策略一按即睇︰  https://hk.adai.ly/e/fJX5EvX1v2 -----------------------------【匿名港警爆大鑊】韓國KBS時事節目《時事直擊：香港  被自殺》一按即睇但即使欲知後事如何，也沒有下回分解。', '不知是否事件傳播「不良影響」，已被微博刪除，但有人在微博上傳要付費的下載連結，有人看完表示「很棒」。', '網民熱烈討論「姐夫門」，留言「刷新了我的三觀」、「城裏人真是會玩啊」。', '除了原味畫面外，網上還流傳新郎揭發新娘真面目的一幕，影片中兩人在嘉賓注目中登場，走上舞台，舞台後的大螢幕播放2人的成長片段，即將播完之際，大螢幕上突然播放一男一女在床上纏綿的閉路電視角度影片。', '新郎馬上怒吼：「你以為我不知道嗎！」新娘憤而將花球擲向新郎應對，台下一陣騷動。', '（新增內容）常言道人生如戲，戲如人生，但像電影情節般juicy的事件發生在日常生活中，難免會讓人震驚。', '微博昨日（27日）就出現一個「#新郎婚禮上扒皮新娘#」、「#姐夫門#」的熱搜題目，內地一個新郎在婚宴上，播放新娘在房間與情夫纏綿的影片，而這名情夫，正是新郎的姐夫。', '新郎還將涉事的高清無碼影片原汁原味放上內地成人網站。', '蘋果新聞\n\n-----------------------屬於香港人的移民攻略立即登入【全球樓行】 Dream House零距離撰文：陳家雄採訪：陳家雄、張嘉鎣、朱得志、陳新政影片編撰：張嘉鎣攝影：S.dragon、Kin、Kenji、Gary、細釗、黃偉傑、鄭嘉峻、陳錦源剪接、後期、編導：何文超編審：楊智佳3月份樓市呈現小陽春，二手價回升，中央已公佈《粵港澳大灣區發展規劃綱要》，香港鼓吹大灣區買樓的宣傳排 降购！Ｒ恢庇泻Ｍ庵脴I念頭的Chatster都受到這氛圍感染，蠢蠢欲動想睇大灣區樓盤。', '既然Erica如此硬銷大灣區，雙方約定直擊大灣區樓盤，鐵定6月13日中山相見。', '究竟Chatster租樓意志堅定，還是經紀Erica賣樓把口勁？Erica在香港有一間村屋，大部份資產投放在內地，「我會投資停車場，一買就100個、200個車位。', '因為其實我哋而家係啱啱好

### Set n-grams character count (n-1 in n-grams)

In [35]:
CHARACTER_COUNT = 2

# Section 1: Basic n-grams prediction model

Post-processing of the Cantonese corpus would get us a list of strings, where each string could be a phrase, a sentence, or a paragraph. For conciseness, we would call all of these sentences. In this section, we would divide up each sentence into the n-grams and then store the possible next letters for each n-gram in a python dictionary. The naive prediction algorithm would randomly pick from the possible next letters given a certain n-gram to generate candidate sentences.

### Generating n-grams dictionary

This would generate a dictionary where each key is an n-gram and the value would be a list of possible next characters.

In [36]:
#returns dictionary for prediction
def generate_n_grams_dict(processed_corpus):
    result = dict()
    for sentence in processed_corpus:
        #i is the start index of the slice
        for i in range(len(sentence) - CHARACTER_COUNT - 1): # -1 since last slice does not have next char
            grams = sentence[i:i+CHARACTER_COUNT]
            next_char = sentence[i+CHARACTER_COUNT]
            if grams in result:
                result[grams].append(next_char)
            else:
                # as long as there is an n-gram key in the dict, there would be at least one next char
                result[grams] = [next_char] 
    return result

### TODO: visualization of n-grams dict

### Prediction Model

This naive prediction model would, if the sentence has an n-gram in the dictionary, randomly select a next character from the list of potential next characters.

In [37]:
# returns a next character given an n_grams_dict and a sentence.
def predict_next_char(n_grams_dict, sentence):
    potentials = n_grams_dict.get(sentence[-CHARACTER_COUNT:], None)
    return random.choice(potentials) if potentials != None else None

### Testing

Here we would test the implementation of the naive n-grams prediction model, for comparison with more sophisticated language models. We would do two tests, the first one would generate a 200 character sentence, and the second would test the implementation analytically by seeing how many next characters it will predict correctly on the test dataset.

TODO: Add a validation dataset. The current corpus is too small to be used both for training and validation.

#### Test 1

In [38]:
def testing(sentence):
    n_grams_dict = generate_n_grams_dict(jyutping_corpus)
    tmp = sentence
    # Generate a sentence of up to 200 characters, will break if an n-gram not found in n-grams dict.
    for i in range(200):
        res = predict_next_char(n_grams_dict, tmp)
        if res == None:
            break
        else:
            tmp = tmp + res
    return tmp

# Observe that due to the stochastic nature of particle filtering, the result is not pre-determinable
print("Trial 1: ")
print(testing("最後"))
print("Trial 2: ")
print(testing("最後"))

Trial 1: 
最後一集係節目，2000萬元都買到樓
Trial 2: 
最後播嘅一隻子彈用「小型手槍，史密夫威信喺18天的13年自己根本就負擔不起公屋，其後樓價」轉移民意視線，等佢哋畀65歲或以上老人家同合資格殘疾人士畀2蚊港紙就可以用較高嘅壓力，如果地方企業違約金額在過去10.44麥林特製手槍用


#### Test 2
**WARNING** Not a validation dataset

In [39]:
n_grams_dict = generate_n_grams_dict(jyutping_corpus)
count = 0
correct = 0
for sentence in jyutping_corpus:
    for i in range(len(sentence) - CHARACTER_COUNT - 1):
        if predict_next_char(n_grams_dict, sentence[:i+CHARACTER_COUNT]) == sentence[i+CHARACTER_COUNT]:
            correct += 1
        count += 1

print(f"Total of {count} predictions made")
print(f"{correct} predictions correct")
print(f"Prediction accuracy: {(correct/(count)) * 100}%")

Total of 16907 predictions made
12467 predictions correct
Prediction accuracy: 73.73868811734783%


## Section 2: Back-off n-grams language model with interpolated Kneser-Ney smoothing

### Generating n-grams dictionary

To use n-grams with a back-off model, we would need to store more information in the dictionary. In particular, we need to store not only n-grams but also everything down to a unigram. The backoff model works by testing if there is a n-gram and then backing off to n-1, n-2 and so on.

In [40]:
def generate_backoff_n_grams_dict(processed_corpus):
    result = dict()
    for cc in range(1, CHARACTER_COUNT+1):
        for sentence in processed_corpus:
            #i is the start index of the slice
            for i in range(len(sentence) - cc - 1): #-1 since last slice does not have next char
                grams = sentence[i:i+cc]
                next_char = sentence[i+cc]
                if grams in result:
                    if next_char in result[grams]:
                        result[grams][next_char] += 1
                    else:
                        result[grams][next_char] = 1
                else:
                    #as long as there is an n-gram key in the dict, there would be at least one next char
                    result[grams] = {}
                    result[grams][next_char] = 1
                    
    return result
    

In [41]:
generate_backoff_n_grams_dict(jyutping_corpus)

{'綜': {'合': 1},
 '合': {'報': 1,
  '人': 1,
  '社': 1,
  '作': 1,
  '路': 2,
  '法': 1,
  '照': 2,
  '資': 4,
  '，': 1,
  '乎': 1},
 '報': {'道': 3, '，': 1, '訪': 1, '章': 1, '》': 2, '程': 1, '平': 2},
 '道': {'-': 1,
  '嗎': 1,
  '人': 1,
  '其': 1,
  '料': 1,
  '衛': 2,
  '、': 4,
  '要': 1,
  '可': 1,
  '做': 1,
  '延': 1,
  '行': 2,
  '，': 3,
  '刊': 1,
  '歉': 1,
  '被': 1},
 '-': {'-': 526,
  '【': 12,
  '屬': 1,
  '$': 1,
  '無': 2,
  '時': 1,
  's': 1,
  '記': 1,
  '《': 1,
  'D': 1,
  ' ': 2,
  '可': 1,
  '1': 5,
  '2': 5,
  '9': 3,
  '3': 3,
  '4': 3,
  '5': 3,
  '6': 3,
  '7': 4,
  '8': 3,
  'w': 1,
  'e': 1,
  'a': 1,
  'd': 1},
 '【': {'7': 2, '匿': 8, '全': 1, '樓': 2, '抗': 4, '撐': 3, '年': 1},
 '7': {'8': 2,
  '日': 3,
  '年': 6,
  '5': 2,
  '歲': 1,
  '3': 1,
  '億': 2,
  '.': 2,
  '9': 1,
  '號': 6,
  '同': 1,
  '係': 1,
  '嘅': 1,
  '之': 1,
  ' ': 4,
  '0': 3,
  ']': 2,
  '7': 1,
  '月': 2},
 '8': {'頁': 2,
  '年': 9,
  '0': 3,
  '區': 1,
  '成': 1,
  ',': 1,
  '℃': 1,
  '1': 1,
  '億': 1,
  '4': 2,
  '天': 1,
  '號': 2,
  '月

### Prediction Model

The prediction model would first pull out the n-grams from the sentence and try to find a next word in the hash table. If the word is not found, then it would be moved to a lower n-gram until it reaches a unigram.

In [42]:
def predict_next_char_backoff(n_grams_dict, sentence):
    for cc in range(min(len(sentence), CHARACTER_COUNT), 0, -1):
        potentials = n_grams_dict.get(sentence[-cc:], None)
        if potentials != None:
            break
    return max(potentials.keys(), key=lambda x: potentials[x]) if potentials != None else None

### Testing

This would be the same testing as with Section 1. We keep the same tests to see if there is an improvement.

#### Test 1

In [43]:
def testing(sentence):
    n_grams_dict = generate_backoff_n_grams_dict(jyutping_corpus)
    tmp = sentence
    #Generate a sentence of up to 200 characters, will break if an n-gram not found in n-grams dict.
    for i in range(200):
        res = predict_next_char_backoff(n_grams_dict, tmp)
        if res == None:
            break
        else:
            tmp = tmp + res
    return tmp

print("Trial 1: ")
print(testing("最後"))  # TODO: Fix circularity (if it is a problem?)

Trial 1: 
最後播嘅一集係節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事節目《時事


Test 2

In [44]:
n_grams_dict = generate_backoff_n_grams_dict(jyutping_corpus)
count = 0
correct = 0
for sentence in jyutping_corpus:
    for i in range(len(sentence) - CHARACTER_COUNT - 1):
        if predict_next_char_backoff(n_grams_dict, sentence[:i+CHARACTER_COUNT]) == sentence[i+CHARACTER_COUNT]:
            correct += 1
        count += 1

print(f"Total of {count} predictions made")
print(f"{correct} predictions correct")
print(f"Prediction accuracy: {(correct/(count)) * 100}%")

Total of 16907 predictions made
13000 predictions correct
Prediction accuracy: 76.8912284852428%
