In [20]:
import time
print(f"Last updated on {time.asctime(time.gmtime())} UTC")

Last updated on Mon Dec 23 02:13:05 2019 UTC


### Purpose

This notebook would serve to show the design process of the Back-off n-grams Language Models (BNLM) with enhancement from Neural Network Language Models (NNLM) as described the paper "Neural Network Language Model for Chinese Pinyin Input Method Engine" by Chen et all. The task at hand with this language model is given a sequence of syllables, to predict which is most likely the next syllable, a task also known as candidate sentence generation. The model is to be implemented into HKIME, an intelligent input method for Cantonese. There would be three sections in this notebook, each building upon the previous. 

### Breakdown

- Section 1: Basic n-grams prediction model
- Section 2: Back-off n-grams language model with interpolated Kneser-Ney smoothing
- Section 3: BNLM (from section 2) with probabilities calculated with NNLM


In [6]:
import random

# Jyutping Corpus Processing

In [17]:
#TODO: Add corpus processing
#TODO: Find better corpuses
#TODO: Look into webscraping to generate corpuses ourselves
jyutping_corpus = []

# Section 1: basic n-grams prediction model

Post-processing of the cantonese corpus would get us a list of strings, where each string could be a phrase, a sentence, or a paragraph. For conciseness, we would call all of these sentences. In this section, we would divide up each sentence into the n-grams and then store the possible next letters for each n-gram in a python dictionary. The naive prediction algorithm would randomly pick from the possible next letters given a certain n-gram to generate candidate sentences.

### Set n-grams character count (the n in n-grams)

In [18]:
CHARACTER_COUNT = 2

### Generating n-grams dictionary

This would generate a dictionary where each key is an n-gram and the value would be a list of possible next characters.

In [19]:
#returns dictionary for prediction
def generate_n_grams_dict(processed_corpus):
    result = dict()
    for sentence in processed_corpus:
        #i is the start index of the slice
        for i in range(len(sentence) - CHARACTER_COUNT - 1): #-1 since last slice does not have next char
            grams = sentence[i:i+CHARACTER_COUNT]
            next_char = sentence[i+CHARACTER_COUNT]
            if grams in result:
                result[grams].append(next_char)
            else:
                #as long as there is an n-gram key in the dict, there would be at least one next char
                result[grams] = [next_char] 
    return result

### TODO: visualization of n-grams dict

### Prediction Model

This naive prediction model would, if the sentence has an n-gram in the dictionary, randomly select a next character from the list of potential next characters.

In [21]:
#returns a next character given an n_grams_dict and a sentence.
def predict_next_char(n_grams_dict, sentence):
    potentials = n_gram_dict.get(sentence[-CHARACTER_COUNT:], None)
    return random.choice(potentials) if potentials != None else None

### Testing

Here we would test the implementation of the naive n-grams prediction model, for comparison with more sophisticated language models. We would do two tests, the first one would generate a 200 character sentence, and the second would test the implementation analytically by seeing how many next characters it will predict correctly on the test dataset.

TODO: Add a validation dataset. The current corpus is too small to be used both for training and validation.

#### Test 1

In [10]:
def testing(sentence):
    n_grams_dict = generate_n_grams_dict(jyutping_corpus)
    tmp = sentence
    #Generate a sentence of up to 200 characters, will break if an n-gram not found in n-grams dict.
    for i in range(200):
        res = predict_next_char(n_grams_dict, tmp)
        if res == None:
            break
        else:
            tmp = tmp + res
    return tmp

#testing("(leihou chinese characters)")
#testing("(mgoi chinese characters)")

#### Test 2

In [16]:
n_grams_dict = generate_n_grams_dict(jyutping_corpus)
count = 0
correct = 0
for sentence in jyutping_corpus:
    for i in range(len(sentence) - CHARACTER_COUNT - 1):
        if predict_next_char(n_grams_dict, sentence[:i+CHARACTER_COUNT]) == sentence[i+CHARACTER_COUNT]:
            correct += 1
        count += 1

print(f"Total of {count} predictions made")
print(f"{correct} predictions correct")
print(f"Prediction accuracy: {correct/(count+1)}%") #TODO: Remove the +1 after there is content in the jyutping_corpus

Total of 0 predictions made
0 predictions correct
Prediction accuracy: 0.0%


## Section 2: Back-off n-grams language model with interpolated Kneser-Ney smoothing