## NLP Fastai

To make sense of the NLP chapter, I'm building out the imdb classifier using all 3 libraries, fastai, hugging faces, and pytorch. It doesn't seem like it makes sense to do a pure python one yet since we did not go over embeddings very much.

In [1]:
import fastai.text.all as fai_text
import torch

import numpy as np
import pandas as pd

from pathlib import Path
from functools import partial

In [2]:
device = "cuda" if torch.cuda.is_available() else "cpu"
device

'cuda'

In [3]:
path = fai_text.untar_data(fai_text.URLs.IMDB)
path

Path('/home/daynil/.fastai/data/imdb')

In [7]:
files = fai_text.get_text_files(path, folders=['train', 'test', 'unsup'])

In [10]:
txt = files[0].open().read()
txt[:75]

'What could have been an excellent hostage movie was totally ruined by what '

In [11]:
txts = fai_text.L(o.open().read() for o in files[:2000])

Read and concat the corpus of text, creating a tmp directory with the corpus in a temporary directory (default `./tmp`). 

Finds the common sequences of characters to create a vocab. E.g., most frequently occuring sequences of chars get their own token.

Fastai uses the google tokenizer library [sentencepiece](https://github.com/google/sentencepiece) to do this.

After tokenization, the corpus file is deleted and a tokenizer model and vocab file are created in the temporary directory.

In [98]:
sp = fai_text.SubwordTokenizer()
sp.setup(txts)

{'sp_model': Path('tmp/spm.model')}

In [100]:
toks = sp([txt])
" ".join(next(toks))[:75]

'▁What ▁could ▁have ▁been ▁an ▁excellent ▁hostage ▁movie ▁was ▁totally ▁ruin'

Fastai adds its own functionality on top of google's subword tokenizer. It adds special tokens, like xxbos (beginning of stream indicator)

In [104]:
tkn = fai_text.Tokenizer(sp)
# Note coll_repr is literally just printing the first x items of a list
# But makes it easier to work with lists that are possibly generators, so we'll use that
print(fai_text.coll_repr(tkn(txt), 31))

(#234) ['▁xxbos','▁xxmaj','▁what','▁could','▁have','▁been','▁an','▁excellent','▁hostage','▁movie','▁was','▁totally','▁ruined','▁by','▁what','▁apparently','▁looks','▁like','▁a','▁bored','▁director','▁...','▁there','▁were','▁so','▁many','▁direction','s','▁that','▁the','▁movie'...]


Next, we need to numericalize our tokens, which just means replacing each token with its index in the vocab.

We'll use a small sample of 200 instead of the full corpus.

In [110]:
toks200 = txts[:200].map(tkn)
toks200[0][:4]

['▁xxbos', '▁xxmaj', '▁what', '▁could']

In [113]:
num = fai_text.Numericalize()
num.setup(toks200)
fai_text.coll_repr(num.vocab, 20)

'(#2464) [\'xxunk\',\'xxpad\',\'xxbos\',\'xxeos\',\'xxfld\',\'xxrep\',\'xxwrep\',\'xxup\',\'xxmaj\',\'▁xxmaj\',\'▁the\',\'.\',\',\',\'s\',\'▁a\',\'▁of\',\'▁and\',\'▁to\',"\'",\'▁it\'...]'

In [116]:
toks = tkn(txt)
fai_text.coll_repr(toks, 20)

"(#234) ['▁xxbos','▁xxmaj','▁what','▁could','▁have','▁been','▁an','▁excellent','▁hostage','▁movie','▁was','▁totally','▁ruined','▁by','▁what','▁apparently','▁looks','▁like','▁a','▁bored'...]"

In [119]:
nums = num(toks)
fai_text.coll_repr(nums, 20)

'(#234) [TensorText(51),TensorText(9),TensorText(72),TensorText(115),TensorText(44),TensorText(103),TensorText(58),TensorText(700),TensorText(1280),TensorText(28),TensorText(27),TensorText(644),TensorText(0),TensorText(54),TensorText(72),TensorText(1088),TensorText(534),TensorText(55),TensorText(14),TensorText(1881)...]'

In [120]:
print(num.vocab[72], num.vocab[115], num.vocab[44])

▁what ▁could ▁have


Next, we need to set up a way of feeding a large corpus of text into a language model to train it. 

With images, we had to resize each image so that it was a consistent size, e.g. 224x224px. This is because tensors require a regular shape in order to function. However, we cannot simply resize text to whatever length we want.

Training a language model involves (in this case) asking it to predict the *next word* in some text. Importantly, that means *order matters*. 

What we can do is concat the entire corpus into a single text stream, then break it out into a number of batches, where each batch starts where the last one ended.

Using this text as an example:
> In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while.

In [121]:
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)
bs, seq_len = 6, 15

In [132]:
df = pd.DataFrame(np.array([tokens[i*seq_len : (i+1)*seq_len] for i in range(bs)]))
df

Unnamed: 0,0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
0,▁xxbos,▁xxmaj,▁in,▁this,▁chapter,",",▁we,▁will,▁go,▁back,▁over,▁the,▁example,▁of,▁class
1,ifying,▁movie,▁reviews,▁we,▁studi,ed,▁in,▁chapter,▁1,▁and,▁dig,▁deep,er,▁under,▁the
2,▁surface,.,▁xxmaj,▁first,▁we,▁will,▁look,▁at,▁the,▁process,ing,▁steps,▁necessary,▁to,▁convert
3,▁text,▁into,▁numbers,▁and,▁how,▁to,▁custom,ize,▁it,.,▁xxmaj,▁by,▁doing,▁this,","
4,▁we,',ll,▁have,▁another,▁example,▁of,▁the,▁pre,pro,ce,s,s,or,▁used
5,▁in,▁the,▁da,ta,▁block,▁xxup,▁a,p,i,.,▁xxmaj,▁then,▁we,▁will,▁study


Now we have 6 batches of streams **where the order is preserved**, we have the data in the format we need to be able to feed it into a model.

However, one further wrinkle is that for a realistic corpus like IMDB reviews, this would be millions of columns wide, not just 15, even if we had a much larger batch size like 64.

To solve this, we can create a left-to-right sliding window of mini-streams of data. This still **preserves the order**, but allows us to more tightly control the size of each sample.

In [136]:
bs, seq_len = 6, 5
df = pd.DataFrame(np.array([tokens[i*15 : i*15+seq_len] for i in range(bs)]))
print("First batch of text")
df

First batch of text


Unnamed: 0,0,1,2,3,4
0,▁xxbos,▁xxmaj,▁in,▁this,▁chapter
1,ifying,▁movie,▁reviews,▁we,▁studi
2,▁surface,.,▁xxmaj,▁first,▁we
3,▁text,▁into,▁numbers,▁and,▁how
4,▁we,',ll,▁have,▁another
5,▁in,▁the,▁da,ta,▁block


In [137]:
bs, seq_len = 6, 5
df = pd.DataFrame(np.array([tokens[i*15+seq_len : i*15+2*seq_len] for i in range(bs)]))
print("Second batch of text")
df

Second batch of text


Unnamed: 0,0,1,2,3,4
0,",",▁we,▁will,▁go,▁back
1,ed,▁in,▁chapter,▁1,▁and
2,▁will,▁look,▁at,▁the,▁process
3,▁to,▁custom,ize,▁it,.
4,▁example,▁of,▁the,▁pre,pro
5,▁xxup,▁a,p,i,.


In [138]:
bs, seq_len = 6, 5
df = pd.DataFrame(np.array([tokens[i*15+2*seq_len : i*15+3*seq_len] for i in range(bs)]))
print("Third batch of text")
df

Third batch of text


Unnamed: 0,0,1,2,3,4
0,▁over,▁the,▁example,▁of,▁class
1,▁dig,▁deep,er,▁under,▁the
2,ing,▁steps,▁necessary,▁to,▁convert
3,▁xxmaj,▁by,▁doing,▁this,","
4,ce,s,s,or,▁used
5,▁xxmaj,▁then,▁we,▁will,▁study


Applying this process to the IMDB reviews dataset, we can create a stream by combining the individual documents (each document is a text file with a single review).

For more effecient training, we can randomize the order in which the documents are combined into a stream on each epoch. **Importantly, we randomize the order of the documents, not the order of the text within them**.

Once we have a stream each epoch, we cut that stream into a batch of fixed-size *consecutive* mini-streams. The model then reads the mini-streams in order.

This is done behind the scenes by the fastai `LMDataLoader`. Here, it picks a batch size of 64 automatically, and our stream length is 72.

In [139]:
nums200 = toks200.map(num)
dl = fai_text.LMDataLoader(nums200)
x,y = fai_text.first(dl)
x.shape, y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

In [143]:
# The independent variable is just the start of the text
print(' '.join(num.vocab[o] for o in x[0][:10]))
# And the label is the same thing, but offset by 1 token
# In other words, we want our model to guess the next token, in this case "_was"
print(' '.join(num.vocab[o] for o in y[0][:10]))

▁xxbos ▁xxmaj ▁what ▁could ▁have ▁been ▁an ▁excellent ▁hostage ▁movie
▁xxmaj ▁what ▁could ▁have ▁been ▁an ▁excellent ▁hostage ▁movie ▁was


Now that we've seen the individual pieces, we can create a dataloader for the full dataset.

In [5]:
get_imdb = partial(fai_text.get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = fai_text.DataBlock(
    blocks=fai_text.TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb,
    splitter=fai_text.RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)


In [6]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj kurt xxmaj russell is so believable and the action so non - stop that it takes thinking about it afterward to realize that there were honest - to - goodness important themes [ overcoming fear of xxmaj the xxmaj stranger , learning to rise above early conditioning , the strength that love and friendship can bring , etc . ] in the storyline . xxmaj this is so very rare for a ' guy 's action flick '","xxmaj kurt xxmaj russell is so believable and the action so non - stop that it takes thinking about it afterward to realize that there were honest - to - goodness important themes [ overcoming fear of xxmaj the xxmaj stranger , learning to rise above early conditioning , the strength that love and friendship can bring , etc . ] in the storyline . xxmaj this is so very rare for a ' guy 's action flick ' that"
1,content to preside over a sham of a hearing ! xxmaj this is just the sort of film they should show young law students in order to elicit a few laughs at all the histrionics . \n\n xxmaj believe me that there are thousands of better films out there from the 1930s waiting to be discovered . xxmaj try almost xxup any film of the era and you 're bound to be better off than with this silly dud .,to preside over a sham of a hearing ! xxmaj this is just the sort of film they should show young law students in order to elicit a few laughs at all the histrionics . \n\n xxmaj believe me that there are thousands of better films out there from the 1930s waiting to be discovered . xxmaj try almost xxup any film of the era and you 're bound to be better off than with this silly dud . xxmaj


He gets pretty handwavy at this point, ultimately saying we'll learn to build this model (a recurrent neural network architecture called AWD-LSTM) from scratch in a later chapter, so I will avoid spending too long diving into details here.

Ultimately, this part of the process also converts the integer word indices into activations by using embeddings, which is what ultimately gets fed into the model.

We're using a language model pretrained on wikipedia text, then fine tuning it using our IMDB corpus. In the book they skip over where exactly the wikipedia model comes from, but looking into the library code for `language_model_learner`, it defaults to a pretrained model called `WT103_FWD`, which I assume is Wikipedia Text Forward (they have a backward one as well, wherein predictions are made by shifting the text in the opposite direction).

The default loss function is cross entropy, since this is essentially a classification task (next word prediction probabilities for each of the words in the vocab). `Perplexity`is a metroc often used in NLP for language models, which is just the exponent of the loss function `torch.exp(cross_entropy)`.

Like the vision learner, the language learner automatically calls `freeze` when using a pretrained model (the default), so this only trains the random embeddings of our IMDB corpus (tokens that exist in only the IMDB vocab but not the pretrained wikipedia vocab).

Jeremy says we use `fit_one_cycle` instead of `fine_tune` because fine_tune doesn't save intermediate model results during training. Essentially, `fine_tune` automatically does what we have below - trains one epoch on the frozen model, unfreezes the model, then trains for more epochs. In order to save intermediate results, we split this out by calling fit_one_cycle for an epoch, save the model, unfreeze, then fit_one_cycle again for more epochs ourselves.

In [7]:
learn = fai_text.language_model_learner(
    dls_lm, fai_text.AWD_LSTM, drop_mult=0.3, 
    metrics=[fai_text.accuracy, fai_text.Perplexity()]).to_fp16()

In [8]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.007815,3.900053,0.300463,49.405045,50:08


In [11]:
# learn.save('imdb_1epoch')
learn = learn.load('imdb_1epoch')

Once the first epoch has completed training on a completely frozen encoder (the body of the model, which is the whole model without the head), we unfreeze the whole thing and finish fine tuning the model, this time with a lower learning rate.

Since my training was so slow, I'm just going to do 5 epochs. Should take most of the work day and be done. He had demoed with 10 epochs. I'll save after 5 epochs and could do 5 more epochs later if I wanted to.

In [12]:
learn.unfreeze()
learn.fit_one_cycle(5, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.742926,3.763016,0.315783,43.078175,50:22
1,3.661073,3.662944,0.327573,38.975933,47:28
2,3.546125,3.60345,0.33474,36.724709,1:02:30
3,3.440744,3.570328,0.339421,35.528236,1:14:31
4,3.359791,3.569659,0.339821,35.504482,54:49


In [14]:
# Save the entire model (this allows us to do text-generation,
# which is effectively what a next-token prediction model is trained to do)
learn.save('imdb_full')
# And save just the encoder, which we use for the sentiment classification
learn.save_encoder('imdb_encoder')

## Text Generation

Now that we have a fine-tuned language model, we can use it to do next-token generation. To do this, we actually keep the head of our fine tuned model, because that's exactly what it was trained to do.

Essentially, what we have here is an expert (and unfiltered) IMDB review generator. This is really cool because it is basically a tiny ChatGPT, fine tuned on the IMDB corpus for movie reviews. The key difference is that we don't have an RLHF (reinforcement learning, human feedback) step to follow up, so we can't just ask it questions and expect good answers like chatGPT is capable of. We need to carefully write prompts which will produce text that logically follows.

Ultimately though, with good prompt engineering, we basically have a fine-tuned GPT model (though the original model is tiny, only trained on the wikipedia corpus).

Note that temperature picks a random word based on the probabilities returned by the model for the vocab. The higher the temp, the more "creative" the model, in the sense that it has a higher chance to select words other than those with the highest probabilities. It effectively increases the randomness. For a low temp, the model produces "cold" rational answers in the sense that it almost always selects from just the highest probability word, so it effectively reduces the randomness.

The formula is basically $tempPreds = preds / temp$, so a higher temp will increase the probabilities of all the lower probability words relative to the higest probability words.

In [15]:
prompt = "I liked this movie because"
words_per_review = 40
total_reviews = 2
preds = [
    learn.predict(prompt, words_per_review, temperature=0.75) 
    for _ in range(total_reviews)
]

In [17]:
preds

['i liked this movie because Johnny was a middle aged man and that he was a iv to Max ( he was a dark knight in the end ) . His relationship with King is very good because it was powerful',
 "i liked this movie because of it , but to me was painful to watch . This movie is very bad . i saw the movie when i was in school , started to watch it , and was glad i did n't ."]

In [18]:
learn.predict(
    "This fucking movie sucked because",
    words_per_review,
    temperature=0.75
)

'This xxunk movie sucked because of the high - schoolers , who were the ones who were allowed to be it . It is a very good story , but i hate it . Not so bad , i suppose . But'

In [23]:
learn.predict(
    "I really enjoyed this movie because",
    words_per_review,
    temperature=0.5
)

'i really enjoyed this movie because it was so good . It was one of the best movies i have ever seen . The acting was very good , and the actors were great . And i was really surprised to see that'

If we think through the implications of this, it is quite fascinating, because this is exactly how chatGPT was trained (again, other than RLHF).

We did not explicitly teach the model english - we did not teach spelling, sentence structure, or grammar. And yet, the model has effectively learned to speak english, and with an IMDB review style dialect.

The only thing we did is initialize a bunch of **random** parameters (an embedding for each word in the vocab), and by **self-supervised** learning, predicting the next word in a bunch of text (or, as with transformers, predicting words for blanks in text), these learned parameters have become  so effective that we've basically **produced a function that can write in english, and to a certain degree, even reason**.

What we have now is a language model that can speak wikipedia-style English with an IMDB review dialect, but we can easily find models online like GPT-2 or some other ones that are open and much larger that can even translate between languages and write code. Then, we can fine tune them to any dialect we want (a corpus of documentation text, or even the writing style of any individual).

Then, we can further fine tune these extremely effective models into very tightly controlled functions that do a specific thing really well, like classify IMDB reviews as positive or negative. For specific tasks like this, GPT-3 and up are probably absolute overkill anyway, and inference time would be dramatically faster on these smaller models.

## Classification

Finally, we can use our fine-tuned language model to create a fine-tuned classifier of IMDB reviews as positive or negative. For this, we do need labeled reviews (so this is not self-supervised).

This makes the training process more similar to what we've done for our other training tasks - get a bunch of data with labels, and feed the data and the labels to the model to adjust its parameters.

A couple of differences from a fastai perspective of how we train a language model vs. using a language model to train a classifier:
* We do not pass `is_lm=True` to the TextBlock function.
* We pass the vocab we created for the language model

`is_lm=False` (the default) just tells fastai that we're using our own labels, rather than setting up a self-supervised next-word prediction dataloader.

We also need to use the same vocab, otherwise the embeddings the model learned in the language model phase will make no sense to this classification model.

In [24]:
dls_clas = fai_text.DataBlock(
    blocks=(fai_text.TextBlock.from_folder(path, vocab=dls_lm.vocab),
            fai_text.CategoryBlock),
    get_y = fai_text.parent_label,
    get_items=partial(fai_text.get_text_files, folders=['train', 'test']),
    splitter=fai_text.GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In [25]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos * * attention xxmaj spoilers * * \n\n xxmaj first of all , let me say that xxmaj rob xxmaj roy is one of the best films of the 90 's . xxmaj it was an amazing achievement for all those involved , especially the acting of xxmaj liam xxmaj neeson , xxmaj jessica xxmaj lange , xxmaj john xxmaj hurt , xxmaj brian xxmaj cox , and xxmaj tim xxmaj roth . xxmaj michael xxmaj canton xxmaj jones painted a wonderful portrait of the honor and dishonor that men can represent in themselves . xxmaj but alas … \n\n it constantly , and unfairly gets compared to "" braveheart "" . xxmaj these are two entirely different films , probably only similar in the fact that they are both about xxmaj scots in historical xxmaj scotland . xxmaj yet , this comparison frequently bothers me because it seems",pos
2,"xxbos xxrep 3 * xxmaj warning - this review contains "" plot spoilers , "" though nothing could "" spoil "" this movie any more than it already is . xxmaj it really xxup is that bad . xxrep 3 * \n\n xxmaj before i begin , xxmaj i 'd like to let everyone know that this definitely is one of those so - incredibly - bad - that - you - fall - over - laughing movies . xxmaj if you 're in a lighthearted mood and need a very hearty laugh , this is the movie for you . xxmaj now without further ado , my review : \n\n xxmaj this movie was found in a bargain bin at wal - mart . xxmaj that should be the first clue as to how good of a movie it is . xxmaj secondly , it stars the lame action",neg


## Padding Documents

The final step to fine tuning the classifier is making sure each document loaded into a batch is the same size. Since each document is a different length, we need to find a way to make them match up.

With images, we did this with resizing (squishing), cropping, or padding. With text, we can do this with padding. We add a special token which the model ignores to shorter texts to make them the same size. For performance, we can collect the texts into batches of similar size (so there is less to pad) and use the longest text as the target length.

This is done by the `TextBlock` automatically when `is_lm=False`.

In [26]:
learn = fai_text.text_classifier_learner(dls_clas, fai_text.AWD_LSTM, drop_mult=0.5,
                                        metrics=fai_text.accuracy).to_fp16()

In [27]:
learn = learn.load_encoder('imdb_encoder')

## Fine Tuning the Classifier

Fastai indicates the best way to fine tune an NLP classifier is to use *gradual unfreezing* (in computer vision, we often unfreeze the model all at once), and we also use discriminitive learning rates (i.e., reduce the learning rate as you unfreeze, since the earlier layers are more fundamental and less related to the task).

In [28]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.368773,0.320571,0.867,09:37


Note: Skipped the remaining training so I could experiment, could come back later.

In [None]:
# This unfreezes just the last 2 parameters - aka freezes all but the last 2
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4), 1e-2))

In [None]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

In [None]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

In [36]:
learn.predict("This was a bad movie, I did not like it, the story was poorly written")

('neg', tensor(0), tensor([0.9347, 0.0653]))

In [37]:
learn.predict("""The characterisations here are definitively subpar. No one looks or sounds like the people they portray, except for Laura Aikman, who looks quite a bit like and acts very much like her part. Therein lies the rub. Far too much effort to get that right that everything else fell by the wayside. The title is a misnomer, to say the least. It's not about Archie, it's not about Cary. It's about Dyan. It should have been called "Dyan, Me, Me, Me and the 6 years I spent with that guy to whom I've served up a narcissistic manipulation pie with a light sprinkle of powdered truth".

He said it best, everything is a confrontation to her. And every confrontation in this lifetime movie of the week is a character assassination for every acquaintance she makes. Everything that goes wrong is always someone else's fault. He does drugs, he sold the dog, he's overbearing and controlling, his mother is a beach, his biz partner is a time stealer. Despite all her obvious flaws, the production makes everyone else out to be the bad guy and poor misunderstood her. Every scene is manicured and curated to paint everyone else in a bad light, and even when you think that maybe there's a bit of balance here, it quickly turns to self-victimizing pandering, expecting the audience to be too stupid to see it.

An excellent study into the true character of a bitter ex-lover, but very little in the way of the person for whom we were duped into thinking it was about. It might think it's subtle, but it's not.""")

('neg', tensor(0), tensor([0.8711, 0.1289]))

### Fastai Summary

All in one place for ease of reference.

## NLP Hugging Face Transformers