## NLP Fastai

To make sense of the NLP chapter, I'm building out the imdb classifier using all 3 libraries, fastai, hugging faces, and pytorch. It doesn't seem like it makes sense to do a pure python one yet since we did not go over embeddings very much.

In [127]:
import fastai.text.all as fai_text
import torch

import numpy as np
import pandas as pd

from pathlib import Path

In [3]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [4]:
path = fai_text.untar_data(fai_text.URLs.IMDB)
path

Path('/home/daynil/.fastai/data/imdb')

In [7]:
files = fai_text.get_text_files(path, folders=['train', 'test', 'unsup'])

In [10]:
txt = files[0].open().read()
txt[:75]

'What could have been an excellent hostage movie was totally ruined by what '

In [11]:
txts = fai_text.L(o.open().read() for o in files[:2000])

Read and concat the corpus of text, creating a tmp directory with the corpus in a temporary directory (default `./tmp`). 

Finds the common sequences of characters to create a vocab. E.g., most frequently occuring sequences of chars get their own token.

Fastai uses the google tokenizer library [sentencepiece](https://github.com/google/sentencepiece) to do this.

After tokenization, the corpus file is deleted and a tokenizer model and vocab file are created in the temporary directory.

In [98]:
sp = fai_text.SubwordTokenizer()
sp.setup(txts)

{'sp_model': Path('tmp/spm.model')}

In [100]:
toks = sp([txt])
" ".join(next(toks))[:75]

'▁What ▁could ▁have ▁been ▁an ▁excellent ▁hostage ▁movie ▁was ▁totally ▁ruin'

Fastai adds its own functionality on top of google's subword tokenizer. It adds special tokens, like xxbos (beginning of stream indicator)

In [104]:
tkn = fai_text.Tokenizer(sp)
# Note coll_repr is literally just printing the first x items of a list
# But makes it easier to work with lists that are possibly generators, so we'll use that
print(fai_text.coll_repr(tkn(txt), 31))

(#234) ['▁xxbos','▁xxmaj','▁what','▁could','▁have','▁been','▁an','▁excellent','▁hostage','▁movie','▁was','▁totally','▁ruined','▁by','▁what','▁apparently','▁looks','▁like','▁a','▁bored','▁director','▁...','▁there','▁were','▁so','▁many','▁direction','s','▁that','▁the','▁movie'...]


Next, we need to numericalize our tokens, which just means replacing each token with its index in the vocab.

We'll use a small sample of 200 instead of the full corpus.

In [110]:
toks200 = txts[:200].map(tkn)
toks200[0][:4]

['▁xxbos', '▁xxmaj', '▁what', '▁could']

In [113]:
num = fai_text.Numericalize()
num.setup(toks200)
fai_text.coll_repr(num.vocab, 20)

'(#2464) [\'xxunk\',\'xxpad\',\'xxbos\',\'xxeos\',\'xxfld\',\'xxrep\',\'xxwrep\',\'xxup\',\'xxmaj\',\'▁xxmaj\',\'▁the\',\'.\',\',\',\'s\',\'▁a\',\'▁of\',\'▁and\',\'▁to\',"\'",\'▁it\'...]'

In [116]:
toks = tkn(txt)
fai_text.coll_repr(toks, 20)

"(#234) ['▁xxbos','▁xxmaj','▁what','▁could','▁have','▁been','▁an','▁excellent','▁hostage','▁movie','▁was','▁totally','▁ruined','▁by','▁what','▁apparently','▁looks','▁like','▁a','▁bored'...]"

In [119]:
nums = num(toks)
fai_text.coll_repr(nums, 20)

'(#234) [TensorText(51),TensorText(9),TensorText(72),TensorText(115),TensorText(44),TensorText(103),TensorText(58),TensorText(700),TensorText(1280),TensorText(28),TensorText(27),TensorText(644),TensorText(0),TensorText(54),TensorText(72),TensorText(1088),TensorText(534),TensorText(55),TensorText(14),TensorText(1881)...]'

In [120]:
print(num.vocab[72], num.vocab[115], num.vocab[44])

▁what ▁could ▁have


Next, we need to set up a way of feeding a large corpus of text into a language model to train it. 

With images, we had to resize each image so that it was a consistent size, e.g. 224x224px. This is because tensors require a regular shape in order to function. However, we cannot simply resize text to whatever length we want.

Training a language model involves (in this case) asking it to predict the *next word* in some text. Importantly, that means *order matters*. 

What we can do is concat the entire corpus into a single text stream, then break it out into a number of batches, where each batch starts where the last one ended.

Using this text as an example:
> In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while.

In [121]:
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)
bs, seq_len = 6, 15

In [132]:
df = pd.DataFrame(np.array([tokens[i*seq_len : (i+1)*seq_len] for i in range(bs)]))
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,▁xxbos,▁xxmaj,▁in,▁this,▁chapter,",",▁we,▁will,▁go,▁back,▁over,▁the,▁example,▁of,▁class
1,ifying,▁movie,▁reviews,▁we,▁studi,ed,▁in,▁chapter,▁1,▁and,▁dig,▁deep,er,▁under,▁the
2,▁surface,.,▁xxmaj,▁first,▁we,▁will,▁look,▁at,▁the,▁process,ing,▁steps,▁necessary,▁to,▁convert
3,▁text,▁into,▁numbers,▁and,▁how,▁to,▁custom,ize,▁it,.,▁xxmaj,▁by,▁doing,▁this,","
4,▁we,',ll,▁have,▁another,▁example,▁of,▁the,▁pre,pro,ce,s,s,or,▁used
5,▁in,▁the,▁da,ta,▁block,▁xxup,▁a,p,i,.,▁xxmaj,▁then,▁we,▁will,▁study


Now we have 6 batches of streams **where the order is preserved**, we have the data in the format we need to be able to feed it into a model.

However, one further wrinkle is that for a realistic corpus like IMDB reviews, this would be millions of columns wide, not just 15, even if we had a much larger batch size like 64.

To solve this, we can create a left-to-right sliding window of mini-streams of data. This still **preserves the order**, but allows us to more tightly control the size of each sample.

In [136]:
bs, seq_len = 6, 5
df = pd.DataFrame(np.array([tokens[i*15 : i*15+seq_len] for i in range(bs)]))
print("First batch of text")
df

First batch of text


Unnamed: 0,0,1,2,3,4
0,▁xxbos,▁xxmaj,▁in,▁this,▁chapter
1,ifying,▁movie,▁reviews,▁we,▁studi
2,▁surface,.,▁xxmaj,▁first,▁we
3,▁text,▁into,▁numbers,▁and,▁how
4,▁we,',ll,▁have,▁another
5,▁in,▁the,▁da,ta,▁block


In [137]:
bs, seq_len = 6, 5
df = pd.DataFrame(np.array([tokens[i*15+seq_len : i*15+2*seq_len] for i in range(bs)]))
print("Second batch of text")
df

Second batch of text


Unnamed: 0,0,1,2,3,4
0,",",▁we,▁will,▁go,▁back
1,ed,▁in,▁chapter,▁1,▁and
2,▁will,▁look,▁at,▁the,▁process
3,▁to,▁custom,ize,▁it,.
4,▁example,▁of,▁the,▁pre,pro
5,▁xxup,▁a,p,i,.


In [138]:
bs, seq_len = 6, 5
df = pd.DataFrame(np.array([tokens[i*15+2*seq_len : i*15+3*seq_len] for i in range(bs)]))
print("Third batch of text")
df

Third batch of text


Unnamed: 0,0,1,2,3,4
0,▁over,▁the,▁example,▁of,▁class
1,▁dig,▁deep,er,▁under,▁the
2,ing,▁steps,▁necessary,▁to,▁convert
3,▁xxmaj,▁by,▁doing,▁this,","
4,ce,s,s,or,▁used
5,▁xxmaj,▁then,▁we,▁will,▁study


Applying this process to the IMDB reviews dataset, we can create a stream by combining the individual documents (each document is a text file with a single review).

For more effecient training, we can randomize the order in which the documents are combined into a stream on each epoch. **Importantly, we randomize the order of the documents, not the order of the text within them**.

Once we have a stream each epoch, we cut that stream into a batch of fixed-size *consecutive* mini-streams. The model then reads the mini-streams in order.

This is done behind the scenes by the fastai `LMDataLoader`. Here, it picks a batch size of 64 automatically, and our stream length is 72.

In [139]:
nums200 = toks200.map(num)
dl = fai_text.LMDataLoader(nums200)
x,y = fai_text.first(dl)
x.shape, y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

In [143]:
# The independent variable is just the start of the text
print(' '.join(num.vocab[o] for o in x[0][:10]))
# And the label is the same thing, but offset by 1 token
# In other words, we want our model to guess the next token, in this case "_was"
print(' '.join(num.vocab[o] for o in y[0][:10]))

▁xxbos ▁xxmaj ▁what ▁could ▁have ▁been ▁an ▁excellent ▁hostage ▁movie
▁xxmaj ▁what ▁could ▁have ▁been ▁an ▁excellent ▁hostage ▁movie ▁was
