# Fast.AI - Swiftkey : Predict the next word

In [1]:
%reload_ext autoreload
%autoreload 2
%matplotlib inline

from fastai.learner import *

import torchtext
from torchtext import vocab, data
from torchtext.datasets import language_modeling

from fastai.rnn_reg import *
from fastai.rnn_train import *
from fastai.nlp import *
from fastai.lm_rnn import *

import dill as pickle

## Language modeling

### Data

We will simply try to create a *language model*; that is, a model that can predict the next word in a sentence. Why? Because our model first needs to understand the structure of English.

The corpora are collected from publicly available sources by a web crawler. The crawler checks for language, so as to mainly get texts consisting of the desired language*.

Each entry is tagged with it's date of publication. Where user comments are included they will be tagged with the date of the main entry.

Each entry is tagged with the type of entry, based on the type of website it is collected from (e.g. newspaper or personal blog) If possible, each entry is tagged with one or more subjects based on the title or keywords of the entry (e.g. if the entry comes from the sports section of a newspaper it will be tagged with "sports" subject).In many cases it's not feasible to tag the entries (for example, it's not really practical to tag each individual Twitter entry, though I've got some ideas which might be implemented in the future) or no subject is found by the automated process, in which case the entry is tagged with a '0'.

To save space, the subject and type is given as a numerical code.

Once the raw corpus has been collected, it is parsed further, to remove duplicate entries and split into individual lines. Approximately 50% of each entry is then deleted. Since you cannot fully recreate any entries, the entries are anonymised and this is a non-profit venture I believe that it would fall under Fair Use.

Unfortunately, there are no good pretrained language models available to download, so we need to create our own. To follow along with this notebook, we suggest downloading the dataset from:
[this location](https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip) .

In [2]:
PATH='data/Swiftkey/'

TRN_PATH = 'train/'
VAL_PATH = 'test/'
TRN = f'{PATH}{TRN_PATH}'
VAL = f'{PATH}{VAL_PATH}'

%ls {PATH}

[0m[01;34mmodels[0m/  [01;34mtest[0m/  [01;34mtmp[0m/  [01;34mtrain[0m/


Let's look inside the training folder...

In [3]:
trn_files = !ls {TRN}
trn_files[:3]

['trainBlogs.txt', 'trainNews.txt', 'trainTwitter.txt']

...and at an example review.

In [4]:
review = !cat {TRN}{trn_files[0]}
review[1]

'See the blessed of love'


Now we'll check how many words are in the dataset.

In [5]:
!find {TRN} -name '*.txt' | xargs cat | wc -w

7164194


In [6]:
!find {VAL} -name '*.txt' | xargs cat | wc -w

3055888


Before we can analyze text, we must first *tokenize* it. This refers to the process of splitting a sentence into an array of words (or more generally, into an array of *tokens*).

In [7]:
' '.join(spacy_tok(review[1]))

'See the blessed of love'

We use Pytorch's [torchtext](https://github.com/pytorch/text) library to preprocess our data, telling it to use the wonderful [spacy](https://spacy.io/) library to handle tokenization.

First, we create a torchtext *field*, which describes how to preprocess a piece of text - in this case, we tell torchtext to make everything lowercase, and tokenize it with spacy.

In [8]:
TEXT = data.Field(lower=True, tokenize=spacy_tok)

fastai works closely with torchtext. We create a ModelData object for language modeling by taking advantage of `LanguageModelData`, passing it our torchtext field object, and the paths to our training, test, and validation sets. In this case, we don't have a separate test set, so we'll just use `VAL_PATH` for that too.

As well as the usual `bs` (batch size) parameter, we also not have `bptt`; this define how many words are processing at a time in each row of the mini-batch. More importantly, it defines how many 'layers' we will backprop through. Making this number higher will increase time and memory requirements, but will improve the model's ability to handle long sentences.

In [9]:
bs=64; bptt=70

In [10]:
FILES = dict(train=TRN_PATH, validation=VAL_PATH, test=VAL_PATH)
md = LanguageModelData.from_text_files(PATH, TEXT, **FILES, bs=bs, bptt=bptt, min_freq=10)

After building our `ModelData` object, it automatically fills the `TEXT` object with a very important attribute: `TEXT.vocab`. This is a *vocabulary*, which stores which words (or *tokens*) have been seen in the text, and how each word will be mapped to a unique integer id. We'll need to use this information again later, so we save it.

*(Technical note: python's standard `Pickle` library can't handle this correctly, so at the top of this notebook we used the `dill` library instead and imported it as `pickle`)*.

In [11]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

Here are the: # batches; # unique tokens in the vocab; # tokens in the training set; # sentences

In [12]:
len(md.trn_dl), md.nt, len(md.trn_ds), len(md.trn_ds[0].text)

(1904, 26094, 1, 8538068)

This is the start of the mapping from integer IDs to unique tokens.

In [13]:
# 'itos': 'int-to-string'
TEXT.vocab.itos[:12]

['<unk>', '<pad>', '.', 'the', ',', 'to', 'and', 'a', 'i', 'of', 'in', '!']

In [14]:
# 'stoi': 'string to int'
TEXT.vocab.stoi['the']

3

Note that in a `LanguageModelData` object there is only one item in each dataset: all the words of the text joined together.

In [15]:
md.trn_ds[0].text[:12]

['x',
 'see',
 'the',
 'blessed',
 'of',
 'love',
 'with',
 'the',
 'stove',
 'burner',
 'off',
 ',']

torchtext will handle turning this words into integer IDs for us automatically.

In [16]:
TEXT.numericalize([md.trn_ds[0].text[:12]])

Variable containing:
  1248
   100
     3
  2281
     9
    92
    21
     3
  8453
 13309
   138
     4
[torch.cuda.LongTensor of size 12x1 (GPU 0)]

Our `LanguageModelData` object will create batches with 64 columns (that's our batch size), and varying sequence lengths of around 80 tokens (that's our `bptt` parameter - *backprop through time*).

Each batch also contains the exact same data as labels, but one word later in the text - since we're trying to always predict the next word. The labels are flattened into a 1d array.

In [17]:
next(iter(md.trn_dl))

(Variable containing:
   1248     11    600  ...      18     10    470
    100     30      3  ...      71   1035    471
      3    114     78  ...      10   1887      2
         ...            ⋱           ...         
    241    163      0  ...     299     21     11
    112    371     18  ...      51     40    676
   2803     31     46  ...       3     25      5
 [torch.cuda.LongTensor of size 68x64 (GPU 0)], Variable containing:
    100
     30
      3
   ⋮   
   9791
    173
    425
 [torch.cuda.LongTensor of size 4352 (GPU 0)])

### Train

We have a number of parameters to set - we'll learn more about these later, but you should find these values suitable for many problems.

In [18]:
em_sz = 200  # size of each embedding vector
nh = 500     # number of hidden activations per layer
nl = 3       # number of layers

Researchers have found that large amounts of *momentum* (which we'll learn about later) don't work well with these kinds of *RNN* models, so we create a version of the *Adam* optimizer with less momentum than it's default of `0.9`.

In [19]:
opt_fn = partial(optim.Adam, betas=(0.7, 0.99))

fastai uses a variant of the state of the art [AWD LSTM Language Model](https://arxiv.org/abs/1708.02182) developed by Stephen Merity. A key feature of this model is that it provides excellent regularization through [Dropout](https://en.wikipedia.org/wiki/Convolutional_neural_network#Dropout). There is no simple way known (yet!) to find the best values of the dropout parameters below - you just have to experiment...

However, the other parameters (`alpha`, `beta`, and `clip`) shouldn't generally need tuning.

In [20]:
learner = md.get_model(opt_fn, em_sz, nh, nl,
               dropouti=0.05, dropout=0.05, wdrop=0.1, dropoute=0.02, dropouth=0.05)
learner.reg_fn = partial(seq2seq_reg, alpha=2, beta=1)
learner.clip=0.3

As you can see below, I gradually tuned the language model in a few stages. I possibly could have trained it further (it wasn't yet overfitting), but I didn't have time to experiment more. Maybe you can see if you can train it to a better accuracy! (I used `lr_find` to find a good learning rate, but didn't save the output in this notebook. Feel free to try running it yourself now.)

In [21]:
learner.fit(3e-3, 4, wds=1e-6, cycle_len=1, cycle_mult=2)

HBox(children=(IntProgress(value=0, description='Epoch', max=15), HTML(value='')))

[0.      5.48938 5.42713]                                     
[1.      5.19743 5.10384]                                     
[2.      5.05661 5.01879]                                     
[3.      5.07179 4.98611]                                     
[4.      4.97089 4.90923]                                     
[5.      4.86919 4.85823]                                     
[6.      4.83635 4.84596]                                     
[7.      4.96943 4.90295]                                     
[8.      4.91927 4.8689 ]                                     
[9.      4.88085 4.84037]                                     
[10.       4.81994  4.81428]                                  
[11.       4.77564  4.79221]                                  
[12.       4.74137  4.77726]                                  
[13.       4.71396  4.76933]                                  
[14.       4.68759  4.76787]                                  



In [22]:
learner.save_encoder('adam1_enc')

In [21]:
learner.load_encoder('adam1_enc')

In [25]:
learner.save_cycle('adam3_10',2)

In [26]:
learner.load_cycle('adam3_10',2)


In [27]:
learner.fit(3e-3, 1, wds=1e-6, cycle_len=10)

HBox(children=(IntProgress(value=0, description='Epoch', max=10), HTML(value='')))

[0.      4.87002 4.84449]                                     
[1.      4.86875 4.83268]                                     
[2.      4.82881 4.8159 ]                                     
[3.      4.79297 4.80067]                                     
[4.      4.78179 4.78485]                                     
[5.      4.72683 4.7676 ]                                     
[6.      4.68094 4.75617]                                     
[7.      4.65825 4.74796]                                     
[8.      4.65066 4.74397]                                     
[9.      4.63694 4.74501]                                     



In [30]:
learner.save_encoder('adam3_10_enc')

In the sentiment analysis section, we'll just need half of the language model - the *encoder*, so we save that part.

In [31]:
learner.save_encoder('adam3_20_enc')

In [32]:
learner.load_encoder('adam3_20_enc')

Language modeling accuracy is generally measured using the metric *perplexity*, which is simply `exp()` of the loss function we used.

In [33]:
#math.exp(4.165)
math.exp(4.745)

115.0078055031094

In [34]:
pickle.dump(TEXT, open(f'{PATH}models/TEXT.pkl','wb'))

### Test

We can play around with our language model a bit to check it seems to be working OK. First, let's create a short bit of text to 'prime' a set of predictions. We'll use our torchtext field to numericalize it so we can feed it to our language model.

In [136]:
m=learner.model
#ss=""". So, it wasn't quite was I was expecting, but I really liked it anyway! The best"""
##------------------------------ Quiz 2 -------------------------------------
#ss="""The guy in front of me just bought a pound of bacon, a bouquet, and a case of"""
#ss="""You're the reason why I smile everyday. Can you follow me please? It would mean the"""
#ss="""Hey sunshine, can you follow me and make me the"""
#ss="""Very early observations on the Bills game: Offense still struggling but the"""
#ss="""Go on a romantic date at the"""
#ss="""Well I'm pretty sure my granny has some old bagpipes in her garage I'll dust them off and be on my"""
#ss="""Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some"""
#ss="""After the ice bucket challenge Louis will push his long wet hair out of his eyes with his little"""
#ss="""Be grateful for the good times and keep the faith during the"""
#ss="""If this isn't the cutest thing you've ever seen, then you must be"""
##------------------------------ Quiz 3 -------------------------------------
#ss="""When you breathe, I want to be the air for you. I'll be there for you, I'd live and I'd"""
#ss="""Guy at my table's wife got up to go to the bathroom and I asked about dessert and he started telling me about his"""
#ss="""I'd give anything to see arctic monkeys this"""
#ss="""Talking to your mom has the same effect as a hug and helps reduce your"""
#ss="""When you were in Holland you were like 1 inch away from me but you hadn't time to take a"""
#ss="""I'd just like all of these questions answered, a presentation of evidence, and a jury to settle the"""
#ss="""I can't deal with unsymetrical things. I can't even hold an uneven number of bags of groceries in each"""
#ss="""Every inch of you is perfect from the bottom to the"""
#ss="""I’m thankful my childhood was filled with imagination and bruises from playing"""
ss="""I like how the same people are in almost all of Adam Sandler's"""

s = [spacy_tok(ss)]
t=TEXT.numericalize(s)
' '.join(s[0])

"I like how the same people are in almost all of Adam Sandler 's"

We haven't yet added methods to make it easy to test a language model, so we'll need to manually go through the steps.

In [137]:
# Set batch size to 1
m[0].bs=1
# Turn off dropout
m.eval()
# Reset hidden state
m.reset()
# Get predictions from model
res,*_ = m(t)
# Put the batch size back to what it was
m[0].bs=bs

Let's see what the top 10 predictions were for the next word after our short text:

In [138]:
nexts = torch.topk(res[-1], 30)[1]
[TEXT.vocab.itos[o] for o in to_np(nexts)]

['<unk>',
 '.',
 'new',
 '"',
 'life',
 'books',
 'own',
 ',',
 'work',
 'and',
 "'",
 'old',
 'first',
 'the',
 'time',
 'eyes',
 'business',
 'history',
 'music',
 '“',
 'self',
 'great',
 'people',
 'problems',
 '(',
 'real',
 'stuff',
 'best',
 'other',
 'past']

...and let's see if our model can generate a bit more text all by itself!

In [64]:
print(ss,"\n")
for i in range(50):
    n=res[-1].topk(2)[1]
    n = n[1] if n.data[0]==0 else n[0]
    print(TEXT.vocab.itos[n.data[0]], end=' ')
    res,*_ = m(n[0].unsqueeze(0))
print('...')

Very early observations on the Bills game: Offense still struggling but the 

first - round pick , the first time in the past two years , the team has been in the league for a year . the team has been in the league since the first round of the season . the team has been in the league since the first ...


### End