# LSTM Demo

This short demo of using TensorFlow/Keras to produce English words took a tutorial by [Jason Brownlee](https://machinelearningmastery.com/develop-word-based-neural-language-models-python-keras/) as starting point.

We start by importing `tensorflow` functions and classes.

In [1]:
from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.utils import to_categorical
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Embedding

2022-02-03 11:28:12.944570: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcudart.so.11.0'; dlerror: libcudart.so.11.0: cannot open shared object file: No such file or directory
2022-02-03 11:28:12.944593: I tensorflow/stream_executor/cuda/cudart_stub.cc:29] Ignore above cudart dlerror if you do not have a GPU set up on your machine.


We now import `numpy`, and `lingpy` (the latter is needed to process the data).

In [2]:
import numpy as np
from lingpy.sequence.sound_classes import get_all_ngrams
from lingpy import Wordlist
import random

We load the English data from the [WOLD](https://github.com/lexibank/wold) repository using `lingpy`.

In [3]:
wl = Wordlist.from_cldf("wold/cldf/cldf-metadata.json")
english = wl.get_list(col="English", entry="tokens", flat=True)
data = ["START "+" ".join(x)+" STOP" for x in english]

We use the `Tokenizer` class to turn our segmented word forms into numerical data.

In [12]:
tokenizer = Tokenizer()
tokenizer.fit_on_texts(data)
tokenizer.idx2tok = {idx: w for w, idx in tokenizer.word_index.items()}
encoded = tokenizer.texts_to_sequences(data)

We now code the sequences for training the model. We split each sequence into all possible n-grams and retain five randomly chosen subsequences with at least two segments (we could train with all n-grams but want to keep training time low for demonstration purposes). We also need to pad the data to the left, in case the sequence is smaller than 20 and truncate it if it is larger, which we do by adding zeros to the left of the numerical sequence.

In [5]:
sequences = []
for word in encoded:
    for i in range(5):
        ngram = random.choice([x for x in get_all_ngrams(word) if len(x) > 1])
        if ngram:
            empty = 21 * [0]
            padded_ngram = empty + ngram
            sequences.append(padded_ngram[-21:])
print(len(sequences))

7580


We now split the sequences into our X and y vectors for training.

In [6]:
X, y = np.array([row[:20] for row in sequences]), np.array([row[-1] for row in sequences])
y = to_categorical(y, num_classes=len(tokenizer.idx2tok)+1)

We now define and train the model.

In [7]:
model = Sequential()
model.add(Embedding(len(tokenizer.idx2tok)+1, 10, input_length=20, mask_zero=True))
model.add(LSTM(50))
model.add(Dense(len(tokenizer.idx2tok)+1, activation='softmax'))
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

2022-02-03 11:28:28.484021: W tensorflow/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libcuda.so.1'; dlerror: libcuda.so.1: cannot open shared object file: No such file or directory
2022-02-03 11:28:28.484048: W tensorflow/stream_executor/cuda/cuda_driver.cc:269] failed call to cuInit: UNKNOWN ERROR (303)
2022-02-03 11:28:28.484068: I tensorflow/stream_executor/cuda/cuda_diagnostics.cc:156] kernel driver does not appear to be running on this host (woody): /proc/driver/nvidia/version does not exist
2022-02-03 11:28:28.484357: I tensorflow/core/platform/cpu_feature_guard.cc:151] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.


Now we can train the model. 

In [41]:
model.fit(X, y, epochs=50, verbose=0)

<keras.callbacks.History at 0x7fd03c36d910>

To retrieve random sequences from this model (English words and pseudowords), we make a loop.

In [39]:
for i in range(10):
    current = 19 * [0]
    current += [tokenizer.word_index["start"]]
    current = np.array([current])
    word = []
    for j in range(20):
        prediction = model.predict(current)
        candidate = np.random.choice(len(prediction[0]), p=prediction[0])
        segment = tokenizer.idx2tok[candidate]
        if segment == "stop":
            break
        word += [segment]
        current = list(current[0])+[candidate]
        current = np.array([current[-20:]])
    print("Word {0:2}: {1}".format(i+1, " ".join(word)))
    

Word  1: k ɔː m ə ɹ
Word  2: θ ɑː p ə n t
Word  3: ɪə l ʌ n
Word  4: ð æ f
Word  5: b ɔː t
Word  6: aɪ t
Word  7: ɜː l ə n s
Word  8: ʊə w ɛ n t
Word  9: r æ
Word 10: tʃ ʌ s m


This is all, with this short demo, we can see that we can produce existing and potential English sequences.