## **Word Language Model**

Welcome to the language model example on Skafos! The purpose of this notebook is to get you going end-to-end and show you how to create a custom model outside of our quickstart models. Below we will do the following:

1. Load Yelp review text data.
2. Build a word-level, neural network language model.
3. Convert the model to CoreML format and save it to the Skafos framework.

The code in this example was adapted from this [**this article**](https://machinelearningmastery.com/how-to-develop-a-word-level-neural-language-model-in-keras/) and follows along with [**this blog post**]() (*coming soon*) that we wrote to help guide you through it.

---

Execute each cell one-by-one, by selecting the cell and do one of the following:

-  Clicking the "play" button at the top of this frame.
-  Typing 'Control + Enter' or 'Shift + Enter'.

#### **Prior to running any code below**
Make sure you have installed all python dependencies in the JLab session before continuing. Open up the terminal and type:
```bash
$ pip install -r requirements.txt
```
Once you've done that - restart the kernel for this notebook by hitting the clockwise arrow at the top of this panel.

In [None]:
# Import necessary libraries - if imports fail, make sure you have installed all dependencies in the requirements.txt
import json
import string
from pickle import dump

from numpy import array
import coremltools
import turicreate as tc
from keras.models import Sequential
from keras.utils import to_categorical
from keras.layers import Dense, LSTM, Embedding
from keras.preprocessing.text import Tokenizer
from keras.preprocessing.sequence import pad_sequences
from skafossdk import *

In [None]:
# Define a few helper functions

# End of sentence tag
eos = "<eos>"

# Convert text entries into a big text blob
def parse_text(data):
    full_text = ""
    for text in data:
        entry = text.replace("\n", "").replace("\'", "").replace(".", f" {eos} ")
        full_text += entry
    return full_text
    
# Turn a text blob into clean tokens
def clean_text(doc):
    # split into tokens by white space
    tokens = doc.split()
    # remove punctuation from each token
    table = str.maketrans("", "", string.punctuation)
    tokens = [w.translate(table) if eos not in w else w for w in tokens]
    # remove remaining tokens that are not alphabetic or not end of sentence tag
    tokens = [word for word in tokens if word.isalpha() or eos in word]
    # make lower case
    tokens = [word.lower() for word in tokens]
    return tokens

### 1. **Load the data**
The training data for this example is Yelp review data. First we load the data from Turi Create.
Then we parse and clean the text, creating sequences of 11 words. The first 10 words in the sequence will be fed to
the neural network as input, and the 11th word will be used as output. We also perform tokenization which maps each word to a unique integer value.

In [None]:
# Load a small sample of user reviews from a yelp dataset
data = tc.SFrame('https://static.turi.com/datasets/regression/yelp-data.csv')['text'].sample(.01) # grab only 1% for this example
print(f'\n\nLoaded {len(data)} text entries from the Yelp review dataset', flush=True)

In [None]:
# Do some initial cleaning and then dump all of the text together into a single document
full_text = parse_text(data)
del(data) # save some space

In [None]:
# Clean the text and perform tokenization
tokens = clean_text(doc=full_text)
print('Total Tokens: %d' % len(tokens), flush=True)
print('Unique Tokens: %d' % len(set(tokens)), flush=True)
print('\nSample Tokens\n', tokens[:50], flush=True)

In [None]:
# Organize into sequences of tokens
length = 10 + 1 # 10 words as input, 1 as output
sequences = list()
for i in range(length, len(tokens)):
    # select sequence of tokens
    seq = tokens[i-length:i]
    # convert into a line
    line = ' '.join(seq)
    # check for eos tag in the line
    if eos in line:
        # if eos tag is the last term in the line - remove it from the end
        if line.endswith(eos):
            line = line[:-4].strip()
        # same thing if it's the first
        elif line.startswith(eos):
            line = line[4:].strip()
        else:
            try:
                front, back = line.split(eos)
                if len(front) > len(back):
                    line = front.strip()
                else:
                    line = back.strip()
            except:
                # skip it if for some reason this fails - we got plenty of data
                continue
    # store line with others
    sequences.append(line)
print(f'Total Sequences: {len(sequences)}', flush=True)

In [None]:
# Encode sequences of words as integers
tokenizer = Tokenizer()
tokenizer.fit_on_texts(sequences)
tokenized_sequences = tokenizer.texts_to_sequences(sequences)
max_sequence_len = max([len(x) for x in tokenized_sequences])
input_sequences = array(pad_sequences(
    tokenized_sequences,
    maxlen=max_sequence_len,
    padding='pre'
))

In [None]:
# Get vocabulary size
vocab_size = len(tokenizer.word_index) + 1
print(f'{vocab_size} total unique words in our training data corpus', flush=True)

In [None]:
# Let's take a look at our tokenized sequences (notice the integer values instead of raw text)
input_sequences[:4]

In [None]:
# Separate sequences into input and output (X and y)
X, y = input_sequences[:,:-1], input_sequences[:,-1]
y = to_categorical(y, num_classes=vocab_size)
seq_length = X.shape[1]

### **2. Train the model**

In [None]:
# Create the model
model = Sequential()
model.add(Embedding(input_dim=vocab_size, output_dim=32, input_length=seq_length))   # Docs: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Embedding
model.add(LSTM(units=128))                                                           # Docs: https://www.tensorflow.org/api_docs/python/tf/keras/layers/LSTM
model.add(Dense(128, activation='relu'))                                             # Docs: https://www.tensorflow.org/api_docs/python/tf/keras/layers/Dense
model.add(Dense(vocab_size, activation='softmax'))
print(model.summary(), flush=True)

# Compile the model
model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy'])

In [None]:
# Train model - this might take a while. For better results, train for additional epochs
model.fit(X, y, batch_size=256, epochs=5)

In [None]:
# Pickup training from where you left off last with the following
# Using an initial_epoch of 5 and epochs of 10, the model will begin at epoch 6 and train up until it reaches 10 (from where you last left off)
#model.fit(X, y, batch_size=256, initial_epoch=5, epochs=10)

In [None]:
# Invert the tokenizer map so we can lookup a word by it's index
index_word_lookup = dict([[v,k] for k,v in tokenizer.word_index.items()])

# Function to generate new text based on the input
def generate_text(seed_text, next_words, max_sequence_len, model):
    for j in range(next_words):
        token_list = pad_sequences(
            sequences=tokenizer.texts_to_sequences([seed_text]),
            maxlen=max_sequence_len-1,
            padding='pre'
        )
        predicted = model.predict_classes(token_list, verbose=0)
        # Generate the output word
        seed_text += " " + index_word_lookup[predicted[0]]
    return seed_text

In [None]:
# Test out the language model by passing in some seed text and the number of words
generate_text("I think that I", 5, length, model)

### 3. **Save the model**
Once your model has been created, it must be exported to Core ML format so it can be used by your app.

In [None]:
# Export the language model to Core ML format
coreml_model_name = "WordLanguageModel.mlmodel"
coreml_model = coremltools.converters.keras.convert(
    model,
    input_names=['tokenizedInputSeq'],
    output_names=['tokenProbs']
)
# Add description information (if you want) and export
coreml_model.short_description = 'Predicts the most likely next word given a string of text'
coreml_model.input_description['tokenizedInputSeq'] = 'An array of 10 tokens according to a pre-deifned mapping'
coreml_model.output_description['tokenProbs'] = 'An array of token probabilities across the entire vocabular'
coreml_model.save(coreml_model_name)

### Putting the model in your app
If you haven't configured your Skafos project to handle Core ML delivery to your app, make sure to do that by entering the proper ID's and Keys on your project page of the dashboard. Follow along with the integration guide from there (you will see the link).

Instead of downloading one of the pre-trained models in the integration guide, go ahead and download the `.mlmodel` that you just trained and converted to Core ML. Drag it to your app's Xcode project. Another important thing is that you will need to include the 'tokenizer.word_index` and `index_word_lookup` dictionaries somewhere in your app so it can translate text to int and vice versa.

Moving forward, as you retrain and update your model, you **won't** need to do that step. You can just save it using the Skafos SDK below:

In [None]:
# Save (push to device) model through the Skafos SDK
## This will trigger an update to your app if you have configed your app with Skafos framework and have downloaded the initial .mlmodel to Xcode

with open(coreml_model_name, 'rb') as model_data:
    # load the coreml model from disk
    model_obj = model_data.read()
    # save through the skafos sdk
    res = ska.engine.save_model(
        coreml_model_name,
        model_obj,
        tags=['latest'],
        access='public'
    )
    # print the result
    print(res.result(), flush=True)

In [None]:
with open('index_work_lookup.json', 'w') as fp:
    json.dump(index_word_lookup, fp)
with open('word_index_lookup.json', 'w') as fp:
    json.dump(tokenizer.word_index, fp)

If you made it here, great work! Another blog post in the future will show how to build an iOS application to use this model!