# N-Gram Demo

In the following, we define a simple bigram language model and train it with English phonetic transcriptions taken from [WOLD](https://github.com/lexibank/wold). 

In [3]:
from collections import defaultdict
from random import choice
from lingpy import Wordlist

wl = Wordlist.from_cldf("wold/cldf/cldf-metadata.json")
words = [tokens for tokens in wl.get_list(col="English", flat=True, entry="tokens")]

model = defaultdict(list)
for w in words:
    bigrams = ["^"]+list(zip(["^"]+w, w+["$"]))+["$"]
    for i in range(len(bigrams)-1):
        model[bigrams[i]] += [bigrams[i+1]]

Having done this, we need a function to "walk" through the Markov chain in order to yield a random sequence. We start from our start symbol `^` and walk until we reach the end symbol `$`.

In [4]:
def walk(model):
    current_char = "^"
    word = []
    while True:
        next_char = choice(model[current_char])
        if next_char != "$":
            word += [next_char[1]]
            current_char = next_char
        else:
            break
    return word[:-1]

We can now apply this function to create English words and potential pseudowords.

In [6]:
for i in range(10):
    word = walk(model)
    print("Word {0:2}: {1}".format(i+1, " ".join(word)))

Word  1: s t ɪ ŋ
Word  2: r uː m
Word  3: d ɜː t
Word  4: ɜː n
Word  5: t r æ k
Word  6: f ə k uː n
Word  7: b æ k s l̩
Word  8: f ʌ r ɪ dʒ p əʊ l d
Word  9: t eɪ t
Word 10: v ɔɪ s
