In [1]:
#hide
!pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

[K     |████████████████████████████████| 720 kB 5.6 MB/s 
[K     |████████████████████████████████| 1.2 MB 22.1 MB/s 
[K     |████████████████████████████████| 189 kB 47.3 MB/s 
[K     |████████████████████████████████| 46 kB 4.6 MB/s 
[K     |████████████████████████████████| 56 kB 4.5 MB/s 
[K     |████████████████████████████████| 51 kB 313 kB/s 
[?25hMounted at /content/gdrive


In [2]:
#hide
from fastbook import *
from IPython.display import display,HTML

# NLP Deep Dive: RNNs

## Text Preprocessing

### Tokenization

### Word Tokenization with fastai

In [3]:
from fastai.text.all import *
path = untar_data(URLs.IMDB)

In [4]:
files = get_text_files(path, folders = ['train', 'test', 'unsup'])

In [5]:
txt = files[0].open().read(); txt[:75]

"This movie is so bad it's funny. It stars Scott Backula as some coach, but "

In [6]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#225) ['This','movie','is','so','bad','it',"'s",'funny','.','It','stars','Scott','Backula','as','some','coach',',','but','that',"'s",'not','important',',','what','is','important','is','the','large','black'...]


In [7]:
first(spacy(['The U.S. dollar $1 is $1.00.']))

(#9) ['The','U.S.','dollar','$','1','is','$','1.00','.']

In [8]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#242) ['xxbos','xxmaj','this','movie','is','so','bad','it',"'s",'funny','.','xxmaj','it','stars','xxmaj','scott','xxmaj','backula','as','some','coach',',','but','that',"'s",'not','important',',','what','is','important'...]


In [9]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html>,
 <function fastai.text.core.replace_rep>,
 <function fastai.text.core.replace_wrep>,
 <function fastai.text.core.spec_add_spaces>,
 <function fastai.text.core.rm_useless_spaces>,
 <function fastai.text.core.replace_all_caps>,
 <function fastai.text.core.replace_maj>,
 <function fastai.text.core.lowercase>]

In [10]:
coll_repr(tkn('&copy;   Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

### Subword Tokenization

In [11]:
txts = L(o.open().read() for o in files[:2000])

In [12]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

In [13]:
subword(1000)

"▁This ▁movie ▁is ▁so ▁bad ▁it ' s ▁funny . ▁It ▁star s ▁S co t t ▁B ack ul a ▁as ▁some ▁co a ch , ▁but ▁that ' s ▁not ▁im port ant , ▁what ▁is ▁im port"

In [14]:
subword(200)

"▁ T h i s ▁movie ▁is ▁s o ▁b a d ▁it ' s ▁f un n y . ▁I t ▁st ar s ▁S c o t t ▁B a ck u la ▁a s ▁s o m"

In [15]:
subword(10000)

"▁This ▁movie ▁is ▁so ▁bad ▁it ' s ▁funny . ▁It ▁stars ▁Scott ▁Back ula ▁as ▁some ▁coach , ▁but ▁that ' s ▁not ▁important , ▁what ▁is ▁important ▁is ▁the ▁large ▁black ▁fellow ▁who ▁plays ▁1 s t ▁base"

### Numericalization with fastai

In [16]:
toks = tkn(txt)
print(coll_repr(tkn(txt), 31))

(#242) ['xxbos','xxmaj','this','movie','is','so','bad','it',"'s",'funny','.','xxmaj','it','stars','xxmaj','scott','xxmaj','backula','as','some','coach',',','but','that',"'s",'not','important',',','what','is','important'...]


In [17]:
toks200 = txts[:200].map(tkn)
toks200[0]

(#242) ['xxbos','xxmaj','this','movie','is','so','bad','it',"'s",'funny'...]

In [18]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)

"(#1984) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','to','of','is','it','i','in'...]"

In [19]:
nums = num(toks)[:20]; nums

TensorText([   2,    8,   20,   27,   16,   49,   72,   17,   22,  170,   10,    8,   17,  634,    8,  806,    8, 1161,   29,   66])

In [20]:
' '.join(num.vocab[o] for o in nums)

"xxbos xxmaj this movie is so bad it 's funny . xxmaj it stars xxmaj scott xxmaj backula as some"

### Putting Our Texts into Batches for a Language Model

In [21]:
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)
bs,seq_len = 6,15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


In [22]:
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
xxbos,xxmaj,in,this,chapter
movie,reviews,we,studied,in
first,we,will,look,at
how,to,customize,it,.
of,the,preprocessor,used,in
will,study,how,we,build


In [23]:
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
",",we,will,go,back
chapter,1,and,dig,deeper
the,processing,steps,necessary,to
xxmaj,by,doing,this,","
the,data,block,xxup,api
a,language,model,and,train


In [24]:
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
over,the,example,of,classifying
under,the,surface,.,xxmaj
convert,text,into,numbers,and
we,'ll,have,another,example
.,\n,xxmaj,then,we
it,for,a,while,.


In [25]:
nums200 = toks200.map(num)

In [26]:
dl = LMDataLoader(nums200)

In [27]:
x,y = first(dl)
x.shape,y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

In [28]:
' '.join(num.vocab[o] for o in x[0][:20])

"xxbos xxmaj this movie is so bad it 's funny . xxmaj it stars xxmaj scott xxmaj backula as some"

In [29]:
' '.join(num.vocab[o] for o in y[0][:20])

"xxmaj this movie is so bad it 's funny . xxmaj it stars xxmaj scott xxmaj backula as some coach"

## Training a Text Classifier

### Language Model Using DataBlock

In [30]:
get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

In [31]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj the hype surrounding xxmaj apichatpong seems to me unwarranted . i am reminded of xxmaj roger xxmaj ebert 's comments on xxmaj abbas xxmaj kiarostami and being utterly unconvinced of the value of his films . \n\n xxmaj first , there is no story . xxmaj as soon as a story might be emerging , "" joe "" ( as he likes to be called these days ) moves to something utterly unrelated . xxmaj he has said","xxmaj the hype surrounding xxmaj apichatpong seems to me unwarranted . i am reminded of xxmaj roger xxmaj ebert 's comments on xxmaj abbas xxmaj kiarostami and being utterly unconvinced of the value of his films . \n\n xxmaj first , there is no story . xxmaj as soon as a story might be emerging , "" joe "" ( as he likes to be called these days ) moves to something utterly unrelated . xxmaj he has said that"
1,"must be evil because the methods are evil . xxmaj just stop thinking for yourself ; trust in xxmaj god and the xxup fbi . xxmaj it 's a dangerous message , and not all that far off from the one that the xxmaj communists drilled into their victims ( for "" god and the xxup fbi "" substitute "" the xxmaj party "" ) . xxmaj oh , well . . . i guess i could n't avoid the","be evil because the methods are evil . xxmaj just stop thinking for yourself ; trust in xxmaj god and the xxup fbi . xxmaj it 's a dangerous message , and not all that far off from the one that the xxmaj communists drilled into their victims ( for "" god and the xxup fbi "" substitute "" the xxmaj party "" ) . xxmaj oh , well . . . i guess i could n't avoid the political"


### Fine-Tuning the Language Model

In [32]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3, 
    metrics=[accuracy, Perplexity()]).to_fp16()

In [33]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time


RuntimeError: ignored

### Saving and Loading Models

In [34]:
learn.save('1epoch')

Path('/root/.fastai/data/imdb/models/1epoch.pth')

In [35]:
learn = learn.load('1epoch')

In [36]:
learn.unfreeze()
learn.fit_one_cycle(10, 2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time


RuntimeError: ignored

In [37]:
learn.save_encoder('finetuned')

### Text Generation

In [38]:
TEXT = "I liked this movie because"
N_WORDS = 40
N_SENTENCES = 2
preds = [learn.predict(TEXT, N_WORDS, temperature=0.75) 
         for _ in range(N_SENTENCES)]

In [39]:
print("\n".join(preds))

i liked this movie because of its style and music , and Roman Catholic folklore and interest in books , and the film End of the Century ( 1995 ) . It now has a number of legal issues and
i liked this movie because it was an adaptation of the novel of the same name . Though the film was not specifically designed to be a Holocaust , the author cited the film as a way to establish Christian Christian


### Creating the Classifier DataLoaders

In [40]:
dls_clas = DataBlock(
    blocks=(TextBlock.from_folder(path, vocab=dls_lm.vocab),CategoryBlock),
    get_y = parent_label,
    get_items=partial(get_text_files, folders=['train', 'test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path, path=path, bs=128, seq_len=72)

In [41]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos xxmaj by now you 've probably heard a bit about the new xxmaj disney dub of xxmaj miyazaki 's classic film , xxmaj laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky . xxmaj during late summer of 1998 , xxmaj disney released "" kiki 's xxmaj delivery xxmaj service "" on video which included a preview of the xxmaj laputa dub saying it was due out in "" 1 xxrep 3 9 "" . xxmaj it 's obviously way past that year now , but the dub has been finally completed . xxmaj and it 's not "" laputa : xxmaj castle xxmaj in xxmaj the xxmaj sky "" , just "" castle xxmaj in xxmaj the xxmaj sky "" for the dub , since xxmaj laputa is not such a nice word in xxmaj spanish ( even though they use the word xxmaj laputa many times",pos
2,"xxbos xxmaj this movie was recently released on xxup dvd in the xxup us and i finally got the chance to see this hard - to - find gem . xxmaj it even came with original theatrical previews of other xxmaj italian horror classics like "" xxunk "" and "" beyond xxup the xxup darkness "" . xxmaj unfortunately , the previews were the best thing about this movie . \n\n "" zombi 3 "" in a bizarre way is actually linked to the infamous xxmaj lucio xxmaj fulci "" zombie "" franchise which began in 1979 . xxmaj similarly compared to "" zombie "" , "" zombi 3 "" consists of a threadbare plot and a handful of extremely bad actors that keeps this ' horror ' trash barely afloat . xxmaj the gore is nearly non - existent ( unless one is frightened of people running around with",neg


In [42]:
nums_samp = toks200[:10].map(num)

In [43]:
nums_samp.map(len)

(#10) [242,604,342,180,786,229,387,137,247,146]

In [44]:
learn = text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5, 
                                metrics=accuracy).to_fp16()

In [45]:
learn = learn.load_encoder('finetuned')

### Fine-Tuning the Classifier

In [46]:
learn.fit_one_cycle(1, 2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.558728,0.50642,0.75864,08:29


In [47]:
learn.freeze_to(-2)
learn.fit_one_cycle(1, slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.365955,0.287063,0.88072,09:45


In [48]:
learn.freeze_to(-3)
learn.fit_one_cycle(1, slice(5e-3/(2.6**4),5e-3))

epoch,train_loss,valid_loss,accuracy,time


RuntimeError: ignored

In [49]:
learn.unfreeze()
learn.fit_one_cycle(2, slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,time


RuntimeError: ignored

## Disinformation and Language Models

## Conclusion

## Questionnaire

1. What is "self-supervised learning"?
1. What is a "language model"?
1. Why is a language model considered self-supervised?
1. What are self-supervised models usually used for?
1. Why do we fine-tune language models?
1. What are the three steps to create a state-of-the-art text classifier?
1. How do the 50,000 unlabeled movie reviews help us create a better text classifier for the IMDb dataset?
1. What are the three steps to prepare your data for a language model?
1. What is "tokenization"? Why do we need it?
1. Name three different approaches to tokenization.
1. What is `xxbos`?
1. List four rules that fastai applies to text during tokenization.
1. Why are repeated characters replaced with a token showing the number of repetitions and the character that's repeated?
1. What is "numericalization"?
1. Why might there be words that are replaced with the "unknown word" token?
1. With a batch size of 64, the first row of the tensor representing the first batch contains the first 64 tokens for the dataset. What does the second row of that tensor contain? What does the first row of the second batch contain? (Careful—students often get this one wrong! Be sure to check your answer on the book's website.)
1. Why do we need padding for text classification? Why don't we need it for language modeling?
1. What does an embedding matrix for NLP contain? What is its shape?
1. What is "perplexity"?
1. Why do we have to pass the vocabulary of the language model to the classifier data block?
1. What is "gradual unfreezing"?
1. Why is text generation always likely to be ahead of automatic identification of machine-generated texts?

### Further Research

1. See what you can learn about language models and disinformation. What are the best language models today? Take a look at some of their outputs. Do you find them convincing? How could a bad actor best use such a model to create conflict and uncertainty?
1. Given the limitation that models are unlikely to be able to consistently recognize machine-generated texts, what other approaches may be needed to handle large-scale disinformation campaigns that leverage deep learning?