# NLP Deep Dive: RNNs

In [1]:
#hide
! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [2]:
#hide
from fastbook import *
from IPython.display import display,HTML

Generally in NLP, the pretrained model is trained on different task.

*Language model* is one that has been trained to predict the next word in a text having read the ones before and this kind of task is called *self-supervised learning*.

Self-supervised learning is not used for model directly trained but is used for pretrained model used for transfer learning.

>Self-supervised learning: Training a model using labels that are embedded in the independent variable instead of requiring external labels.

**Why are we learning to train a language model in detail?**
- it will be helpful to understand the foundations of model you're using.
- practical reason is, we can fine-tune the model(i.e. language model) to get better results prior to fine-tuning classification model.

The IMDb reviews data consists of reviews. We can use all of these reviews to fine-tune the language model that was trained on Wikipedia articles prior to transfer learning to a classification task to get better at predicting the next word of a movie review(ultimately getting better results). This is known as the *Universal Language Model Fine-tuning(ULMFit)* approach.

3 stages for TL in NLP:
1. Language model (trained on Wikipedia articles)
2. Language model (fine-tuned previous model using IMDb reviews data)
3. Then train classification model

## Text Preprocessing

Approach for a single categorical variable to be used for a NN as independent variable:(not understanding the steps 3&4)
1. Make a list of all possible levels of that categorical variable(this is vocab)
2. Replace each level with its index in the vocab.
3. Create an embedding matrix for this containing a row for each level(??)
4. Use the embedding matrix as the first layer of NN.

The same thing can be done for text but
- first we concatenate the documents in our dataset to form a single long list
- then split it into words which gives a long list of words(or tokens)
Our independent variable is sequence of words starting from first word in long list and ending with second to last(does it mean start from first word and ends at the last word).

Our *vocab* contains mix of common words already present in the vocab of pretrained model and new words specific to the movie reviews corpus.

Our *embedding matrix* will be built accordingly: 
1) For words already present in the pretrained vocabulary: When a word from our current task is also present in the vocabulary of pretrained model, we simply copy the corresponding row from the preatrained model's embedding matrix.
2) For new words: When we encounter a new word that wasn't in the pretrained vocabulary and therfore there will be no embedding for that word. So we initialise a new row in our embedding matrix with random vector.

The steps necessary to create a language model has following jargon:

- Tokenization: Convert text into list of words
- Numericalization: Make up a list of all unique words (i.e. vocab), convert them into numbers by looking up its index from vocab.
- Language model data loader creation: fastai provides an `LMDataLoader` class which automatically handles creating a dependent variable which is offset from independent variable by one token.(meaning the dependent and independent variables differ by one token since our task is to predict the next word in text). It also handles some important functions such as shuffling data in order to maintain proper structure of dependent and independent variables.
- Language model creation: we need a special kind of model that takes input lists(big or small). There are number of ways to do this but here we'll use RNNs(recurrent neural nets).

### Tokenization

When we said "convert the text into a list of words," we left out a lot of details. For instance, what do we do with punctuation? How do we deal with a word like "don't"? Is it one word, or two? What about long medical or chemical words? Should they be split into their separate pieces of meaning? How about hyphenated words? What about languages like German and Polish where we can create really long words from many, many pieces? What about languages like Japanese and Chinese that don't use bases at all, and don't really have a well-defined idea of word?

Because there is no one correct answer to these questions, there is no one approach to tokenization.

There are three main approaches-
1. word-based tokenization: split sentence on spaces and taking care of language specific rules to separate parts of meaning even when there is no space.
2. sub-word based tokenization: split words into smaller parts based on commonly occuring substrings.
3. character based tokenization: split sentece into individual characters.

>token: One element of list created using tokenization process.

### Word tokenization with fastai

This relies on an assumption that spaces provide useful separation of components of meaning in sentence.
fastai provides a consistent interface to range of tokenizers instead of providing its own tokenizers.

In [3]:
from fastai.text.all import *
path=untar_data(URLs.IMDB)

To try tokenizers, we need to get the text files. Similar to `get_image_files` fot CV tasks, we have `get_text_files` for NLP tasks to get all text files in a path.

In [4]:
files=get_text_files(path,folders=['train','test','unsup'])

The `folders` parameter is used to restrict the search to particular list of folders.

In [5]:
# grab the first file and view some part of it
files[0]

Path('/teamspace/studios/this_studio/.fastai/data/imdb/test/neg/0_2.txt')

In [6]:
files[0].open()

<_io.TextIOWrapper name='/teamspace/studios/this_studio/.fastai/data/imdb/test/neg/0_2.txt' mode='r' encoding='UTF-8'>

In [7]:
txt=files[0].open().read()[:100]

In [8]:
txt

'Once again Mr. Costner has dragged out a movie for far longer than necessary. Aside from the terrifi'

spaCy is the deafult English word tokenizer when the book was written and instead of using `SpacyTokenizer`, we'll use `WordTokenizer` by fastai which refers to the default tokenizer used now.

fastai's `coll_repr()` is used to display results. Displays first n items of collection along with full size and it's what `L`(which is a list-like collection with added-functionality) uses by default for string representation.

**Note**: fastai's tokenizer takes a list of text so we need to wrap txt in list.

In [9]:
spacy=WordTokenizer()
toks=first(spacy([txt]))
print(coll_repr(toks,30))

(#19) ['Once','again','Mr.','Costner','has','dragged','out','a','movie','for','far','longer','than','necessary','.','Aside','from','the','terrifi']


spaCy handles the all little details in text.

fastai adds some additional functionality to tokenization process with the `Tokenizer` class.

In [10]:
tkn=Tokenizer(spacy)
print(coll_repr(tkn(txt),30))

(#25) ['xxbos','xxmaj','once','again','xxmaj','mr','.','xxmaj','costner','has','dragged','out','a','movie','for','far','longer','than','necessary','.','xxmaj','aside','from','the','terrifi']


The tokens which start with characters 'xx' are called special tokens. By recognising the start token(xxbos), the model will be able to learn it needs to forget what was said previously and focus on upcoming words.

These are not by deafult from spaCy, fastai adds them when processing text and desgined for model to understand important parts of sentence.

Some main special tokens:

- `xxbos`: indicates beginning of text
- `xxmaj`: indicates the next word begins with a capital letter(since everything is lowercased)
- `xxunk`: indicates word is unknown

To see the rules that were used, we can check thedefault rules usinf the line below:

In [11]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html(x)>,
 <function fastai.text.core.replace_rep(t)>,
 <function fastai.text.core.replace_wrep(t)>,
 <function fastai.text.core.spec_add_spaces(t)>,
 <function fastai.text.core.rm_useless_spaces(t)>,
 <function fastai.text.core.replace_all_caps(t)>,
 <function fastai.text.core.replace_maj(t)>,
 <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]

Summary of what each does:
- `fix_html`: replaces special HTML characs with readable version
- `replace_rep`: replaces any character repeated more than three times with a special token(`xxrep`) followed by number of times and then the character
- `replace_wrep`: replaces any word repeated more than three times with a special token(`xxwrep`) followed by number of times and then the word.
- `spec_add_spaces`: adds spaces arong / and #
- `rm_useless_spaces`: removes all repition of space character
- `replace_maj`: lowercases a capitalised word and adds a specila token for capitalises(`xxmaj`) in front.
- `replace_all_caps`: lowercases a word written in all caps and adds a specialised token `xxup`
- `lowercase`: lowercases all text and adds special token at beginning(`xxbos`) and/or end(`xxeos`)

### Subword tokenization 

This tokenization is best for languages such as chinese, hungarian etc where the words are not separated by space.

To handle these, we use this tokenization which has the 2 steps:
1. Analyse corpus of documents to find the most *commonly occuring groups of letters* and this becomes the vocab.
2. Tokenise the corpus using this vocab of subword units.

In [12]:
txts=L(o.open().read() for o in files[:2000])

We instantiate tokenizer and pass size of vocab and then train it, i.e. we want it to read our documnets and find out the common sequence of characters to create the vocab. This is done using `setup` which is automatically called in usual data processing pipelines.

In [13]:
def subword(sz):
    sp=SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt])))

In [14]:
subword(1000)

sentencepiece_trainer.cc(178) LOG(INFO) Running command: --input=tmp/texts.out --vocab_size=1000 --model_prefix=tmp/spm --character_coverage=0.99999 --model_type=unigram --unk_id=9 --pad_id=-1 --bos_id=-1 --eos_id=-1 --minloglevel=2 --user_defined_symbols=▁xxunk,▁xxpad,▁xxbos,▁xxeos,▁xxfld,▁xxrep,▁xxwrep,▁xxup,▁xxmaj --hard_vocab_limit=false


'▁O n ce ▁again ▁M r . ▁Co st n er ▁has ▁ d ra g g ed ▁out ▁a ▁movie ▁for ▁far ▁long er ▁than ▁ ne ce s s ar y . ▁A side ▁from ▁the ▁ ter ri f i'

In [15]:
subword(200)

'▁ O n ce ▁a g a in ▁ M r . ▁ C o s t n er ▁ h a s ▁ d ra g g ed ▁ o u t ▁a ▁movie ▁for ▁f ar ▁ l on g er ▁ th an ▁ n e ce s s ar y . ▁A s i d e ▁f ro m ▁the ▁ ter ri f i'

In [16]:
subword(10000)

'▁On ce ▁again ▁Mr . ▁Costner ▁has ▁dragged ▁out ▁a ▁movie ▁for ▁far ▁longer ▁than ▁necessary . ▁A side ▁from ▁the ▁ ter ri fi'

When we use fastai's subword tokenizer, `_` character represents a space in actual text.

- If we use a *smaller vocab*, then each token will represent *fewer characters* and it will take *more tokens to represent the sentence*.
- If we use a *larger vocab*, the most common English words will end up in the vocab themselves and we *will not need as many to represent a sentence*.

Picking a subword vocab size represents a compromise:

- a *larger vocab* means *fewer tokens per sentence*, which means faster training, lesser memory and less state for model to remember but it means *larger embedding matrix* and require more data to learn.

Overall, subword tokenization provides a way to scale between character tokenizatio(i.e using small subword vocab) and word tokenizatio(i.e using large subword vocab) and handles every language without needing language specific models to be developed.

Once our text is split into tokens, we need to convert them into numbers.

### Numerilization with fastai

*Numericalization* is process of mapping tokens to integers. Steps are similar to those needed to create `Category` variable.

1. Make a list of all possible levels of categorical variable(this is vocab)
2. Replace each level with its index value in vocab.

In [17]:
toks=tkn(txt)
print(coll_repr(tkn(txt)),31)

(#25) ['xxbos','xxmaj','once','again','xxmaj','mr','.','xxmaj','costner','has'...] 31


Just like `SubwordTokenizer`, we need to call `setup` on `Numericalize`, in order to create vocab.
So first we need our tokenized corpus.

In [18]:
toks200=txts[:200].map(tkn)
toks200[0]

(#207) ['xxbos','xxmaj','once','again','xxmaj','mr','.','xxmaj','costner','has'...]

In [19]:
# we then pass this to setup
num=Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)

"(#1968) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','of','to','is','it','i','in'...]"

Our special rules tokens appear first followed by every word appearing once in frequency order. The defaults of `Numericalize` are `min_freq=3, max_vocab=60000`. `max_vocab=60000` results in fastai replacing the words other than most common 60000 words with a special token `xxunk` means unknown word. This is useful to avoid haing large embedding matrix which can slow down traing. `min_freq=3` means any word woith freq<3 is replaced by `xxunk`.

fastai can also numericalize our dataset using vocab that you provide, by passing a list of words as `vocab` parameter.

Once we have created `Numericalize` object, we can use it as a function:

In [20]:
nums=num(toks)[:20]
nums

TensorText([   2,    8,  349,  183,    8, 1177,   10,    8, 1178,   60, 1455,   62,   12,   25,   28,  189,  957,   93,  958,   10])

This time our tokens have been converted to a tensor of integers that our model can recieve. We can that they map back to original text as follows.

In [21]:
' '.join(num.vocab[o] for o in nums)

'xxbos xxmaj once again xxmaj mr . xxmaj costner has dragged out a movie for far longer than necessary .'

Now that we have numbers, we need to put them into batches for our model.

### Putting our texts into batches for a language model

Our language model should read text in order such that each new batch should begin precisely where the previous one left off and this unlike images where we had to resize the images to get same height and width before grouping them into mini-batch so they can stack together efficiently in a single tensor.

Suppose we have the following text:

> : In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while.

The tokenization process will add special tokens and deal with punctuation to return this text:

> : xxbos xxmaj in this chapter , we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface . xxmaj first we will look at the processing steps necessary to convert text into numbers and how to customize it . xxmaj by doing this , we 'll have another example of the preprocessor used in the data block xxup api . \n xxmaj then we will study how we build a language model and train it for a while .

We now have 90 tokens, separated by spaces. Let's say we want a batch size of 6. We need to break this text into 6 contiguous parts of length 15:

In [22]:
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens=tkn(stream)
bs,seq_len=6,15
d_tokens=np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df=pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


We need to divide this array more finely into subarrays of a fixed sequence length. It is impostant to maintain order within and across these subarrays, because we will use a model that maintains a state so that it remembers what it read previously when predicting what comes next.

**Next didn't understand why we need the two blocks of code**.

In [23]:
bs,seq_len=6,5
d_tokens=np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
xxbos,xxmaj,in,this,chapter
movie,reviews,we,studied,in
first,we,will,look,at
how,to,customize,it,.
of,the,preprocessor,used,in
will,study,how,we,build


In [24]:
# this one
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+seq_len:i*15+2*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
",",we,will,go,back
chapter,1,and,dig,deeper
the,processing,steps,necessary,to
xxmaj,by,doing,this,","
the,data,block,xxup,api
a,language,model,and,train


In [25]:
# and this one
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
over,the,example,of,classifying
under,the,surface,.,xxmaj
convert,text,into,numbers,and
we,'ll,have,another,example
.,\n,xxmaj,then,we
it,for,a,while,.


First step is to transform individual texts into stream by concatenating them together and at the beginning of each epoch, we will shuffle entries to make a new stream(we shuffle order of documents and not the order of words inside them).

We then cut the stream into a certain number of batches(batch size). For examples, if stream has 50,000 tokens and we set batch size as 10, this gives us 10 min-streams of 5000 tokens. What is important is that we preserve the order of tokens(i.e first mini-batch from 1-5000 second from 5001-10000 and so on). An `xxbos` is added so that the model knows when a new stream is beginning when it reads it.

So to recap:

- First we shuffle collection of docs(such as articles,text,reviews,etc) at each epoch
- Concatenate them into a long stream
- Cut the stream into mini-streams
- Group mini-streams into batches
- Then the model will read mini-streams in order. It maintains an internal state allowing it to carry context from one mini-stream to next within a batch.

This is all done behind the scenes by fastai library when we create an `LMDataLoader`. We do this by applying our `Numericalize` object to tokenised texts.

In [26]:
nums200=toks200.map(num)

In [27]:
# then pass to LMDataLoader
dl=LMDataLoader(nums200)

In [28]:
# let's confirm it gives expected length by grabbing the first batch
x,y=first(dl)
x.shape,y.shape

(torch.Size([64, 72]), torch.Size([64, 72]))

then we look the independent variable which should be the start of first text.

In [29]:
' '.join(num.vocab[o] for o in x[0][:20])

'xxbos xxmaj once again xxmaj mr . xxmaj costner has dragged out a movie for far longer than necessary .'

The dependent variable is the same thing offset by one token

In [30]:
' '.join(num.vocab[o] for o in y[0][:20])

'xxmaj once again xxmaj mr . xxmaj costner has dragged out a movie for far longer than necessary . xxmaj'

This includes all the preprocessing steps we need to apply to our data and now we are ready to train our text classifier.

## Training a text classifier

There are 2 steps to training a state-of-the-art text classifierusing transfer learning:

1. fine-tune language model pretrained on Wikipedia to corpus of IMDb reviews
2. use the model to train a text classifier.

Lets' start by assembling our data.

### Language model using DataBlock

fastai handles tokenization and numericalization automatically when `TextBlock` is passed to `DataBlock`. All arguments that can be passed to `Tokenize` and `Numericalize` can alse be passed to `TextBlock`.

In order to debug, we can run them manually on a subset of data as we have done before and also `DataBlock`'s `summary` method is very useful fo debugging data issues.

Here's how we use `TextBlock` to create a language model using fastai's deafults

```
get_imdb=partial(get_text_files,folders=['train','test','unsup'])
```
partial is used to create a new function with some of arguments of original function partially applied/prefilled.

This line of code creates a new function get_imdb which always calls `get_text_files` with folders argument set to `['train', 'test', 'unsup']`.

In [31]:
get_imdb=partial(get_text_files,folders=['train','test','unsup'])
dls_lm=DataBlock(
    blocks=TextBlock.from_folder(path,is_lm=True),
    get_items=get_imdb,splitter=RandomSplitter(0.1)
).dataloaders(path,path=path,bs=128,seq_len=80)

In this `DataBlock`, we are not using the `TextBlock` class directly but we are using the class method. We need to tell `TextBlock` how to access the texts, so that it can do initial preprocessing- that's what `from_folder` does.

In [32]:
dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj africa xxmaj screams , one of the least seen of abbott&costello 's films was an independent production that was released through xxmaj united xxmaj artists . xxmaj the thin plot has xxmaj hillary xxmaj brooke believing xxmaj costello has the map to a hidden territory that is rich with diamonds . xxmaj bud and xxmaj lou go to xxmaj africa at her behest with her two companions , the fighting xxmaj baer xxmaj brothers . xxmaj of course","xxmaj africa xxmaj screams , one of the least seen of abbott&costello 's films was an independent production that was released through xxmaj united xxmaj artists . xxmaj the thin plot has xxmaj hillary xxmaj brooke believing xxmaj costello has the map to a hidden territory that is rich with diamonds . xxmaj bud and xxmaj lou go to xxmaj africa at her behest with her two companions , the fighting xxmaj baer xxmaj brothers . xxmaj of course the"
1,"xxmaj xxunk and they got the balls to make the xxmaj christians out to be the intolerant , xenophobic and reactionary half - wits . \n\n xxmaj moral xxmaj orel is still an interesting watch ( as long as it comes between superior shows on xxmaj adult xxmaj swim ) because it is a satire . xxmaj however , xxmaj it is more a satire on the people that make it rather then the people it is depicting . \n\n","xxunk and they got the balls to make the xxmaj christians out to be the intolerant , xenophobic and reactionary half - wits . \n\n xxmaj moral xxmaj orel is still an interesting watch ( as long as it comes between superior shows on xxmaj adult xxmaj swim ) because it is a satire . xxmaj however , xxmaj it is more a satire on the people that make it rather then the people it is depicting . \n\n xxmaj"


Now since our data is ready, we can fine-tune the pretrained language model.

### Fine-tuning Language model

- To convert integer word indices into activations that we can use for our NN, we will use embeddings.
- Then we'll feed those embeddings into a RNN using an architecture called *AWD-LSTM*.

The embeddings of pretrained model are merged with random embeddings added for words that were not in the preatrained vocab. This is handled automatically inside `language_model_learner`.

In [33]:
learn=language_model_learner(
    dls_lm,AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy,Perplexity()]).to_fp16()

The loss function used is cross-entropy since this is a classification problem.

Perplexity is a metric that is exponential of loss `torch.exp(cross_entropy)`.

We'll use `fit_one_cycle` and save intermediate results during model training. Just like `vision_learner`, `language_model_learner` autimatically calls `freeze` when using a pretrained model, so this will only train the embeddings, the only part of model that contains randomly initialised weights i.e. embeddings of words that aren;'t in the pretrained vocabulary but are in the IMDb vocab.

By choosing `fit_one_cycle` over `fine_tune`, you're opting for more manual control over the training process, which is often necessary when working with large models or datasets where you need to carefully manage resources and want to ensure you can resume training from checkpoints if needed.
Remember, you'll need to manually handle aspects like unfreezing layers if you switch from `fine_tune` to `fit_one_cycle`, as `fine_tune` automatically handles some of these steps for you.

In [34]:
learn.fit_one_cycle(1,2e-2)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,4.004734,3.904704,0.300501,49.63538,20:57


### Saving and loading models

We can easily save the staate of our model

In [35]:
learn.save('1epoch')

Path('/teamspace/studios/this_studio/.fastai/data/imdb/models/1epoch.pth')

This will create a file in `learn.path/models/` named *1epoch.pth*. We can then load the contents of the loaded model to resume training or load it on another machine after creating our `Learner` as creatred here.

In [36]:
learn=learn.load('1epoch')

Once initial training has been completed, we continue fine-tuning model after unfreezing.

`unfreeze()` is used to make all laeyrs of NN trainable.

In context of TL,

- we often start witha pre-trained model
- initially, we freeze the early layers and train only last few layers on our new data.

What it does?

- it sets `requires_grad_` to `True` for all parameters of the model, therefore during backpropagation, gradienst will be computed for all layers allowing them to be updated during training.

When to Use unfreeze():

- After initial training with frozen layers, when you want to fine-tune the entire model.
- When you have enough data and computational resources to train the entire network.
- When you want to allow the model to adapt more deeply to your specific task or dataset.

Typical Workflow:

- Start with a pre-trained model with most layers frozen.
- Train only the last few layers for some epochs.
- Call learn.unfreeze() to unfreeze all layers.
- Continue training with a lower learning rate.

In [37]:
learn.unfreeze()
learn.fit_one_cycle(10,2e-3)

epoch,train_loss,valid_loss,accuracy,perplexity,time
0,3.774635,3.761657,0.31704,43.019672,21:39
1,3.712948,3.701956,0.323748,40.526478,21:18
2,3.622169,3.654953,0.329183,38.665722,21:09
3,3.570991,3.620903,0.333016,37.371307,21:28
4,3.512937,3.600293,0.335536,36.608948,21:28
5,3.410577,3.586111,0.337969,36.09343,21:14
6,3.365716,3.577701,0.33949,35.791176,21:41
7,3.294542,3.575607,0.340427,35.716278,21:36
8,3.241257,3.578088,0.340611,35.805027,21:28
9,3.224428,3.582369,0.34043,35.95863,21:28


Once this is completed, we save all of our model except the final layer that converts activatioins to probabilities of picking each token in our vocabulary.

The model not including the final layer is called encoder and we save it using `save_encoder`.

In [38]:
learn.save_encoder('fine_tuned')

>encoder : The model not including the task specific final layer(s). This term means the same as body in CNNs but encoder is more used in NLP.


This completes the second stage, fine-tuning the language model. We can now use it to fine-tune a classifier using IMDb senetiment labels.

### Text generation

Before we go to fine-tuning classifier, we will use the model to generate random reviews.

In [39]:
TEXT='I liked this movie because'
N_WORDS=40
N_SENTENCES=2
preds=[learn.predict(TEXT,N_WORDS,temperature=0.75)
       for _ in range(N_SENTENCES)]

In [40]:
print("\n".join(preds))

i liked this movie because it was a " b " movie . If you like some of the worst movies you 'll ever see , you 'll like this one . If you 're looking for a good horror flick , watch
i liked this movie because it actually turned out to be a real good movie . i thought it was predictable and you could see it coming a mile away . If you can plan on watching the movie you have to please make


### Creating a classifier DataLoaders

Recap - A language model predicts next word of a documnet and doesn't require external labels, whereas our classifier requires external label in this case the sentiment of document.

This means the structure of `DataBlock` for NLP classification will look familiar to image classification datasets.

In [41]:
dls_clas=DataBlock(
    blocks=(TextBlock.from_folder(path,vocab=dls_lm.vocab),CategoryBlock),
    get_y=parent_label,
    get_items=partial(get_text_files,folders=['train','test']),
    splitter=GrandparentSplitter(valid_name='test')
).dataloaders(path,path=path,bs=128,seq_len=72)

The first path is positional argument in dataloaders and specifies the root directory where data is located.

The second path=path is a keywrod argument and in context of text classification it specifies where to save tokenised inputs.

Why we use `GrandparentSplitter`?

because the `GrandparentSplitter` looks at the grandparent folder name to determine the split. The structure of the IDMDb dataset is as follows
```
path/
  train/
    pos/
      review1.txt
      review2.txt
      ...
    neg/
      review1.txt
      review2.txt
      ...
  test/
    pos/
      review1.txt
      review2.txt
      ...
    neg/
      review1.txt
      review2.txt
      ...
```

In [42]:
dls_clas.show_batch(max_n=3)

Unnamed: 0,text,category
0,"xxbos xxmaj match 1 : xxmaj tag xxmaj team xxmaj table xxmaj match xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley vs xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit xxmaj bubba xxmaj ray and xxmaj spike xxmaj dudley started things off with a xxmaj tag xxmaj team xxmaj table xxmaj match against xxmaj eddie xxmaj guerrero and xxmaj chris xxmaj benoit . xxmaj according to the rules of the match , both opponents have to go through tables in order to get the win . xxmaj benoit and xxmaj guerrero heated up early on by taking turns hammering first xxmaj spike and then xxmaj bubba xxmaj ray . a xxmaj german xxunk by xxmaj benoit to xxmaj bubba took the wind out of the xxmaj dudley brother . xxmaj spike tried to help his brother , but the referee restrained him while xxmaj benoit and xxmaj guerrero",pos
1,"xxbos xxmaj some have praised _ xxunk _ as a xxmaj disney adventure for adults . i do n't think so -- at least not for thinking adults . \n\n xxmaj this script suggests a beginning as a live - action movie , that struck someone as the type of crap you can not sell to adults anymore . xxmaj the "" crack staff "" of many older adventure movies has been done well before , ( think _ the xxmaj dirty xxmaj dozen _ ) but _ atlantis _ represents one of the worse films in that motif . xxmaj the characters are weak . xxmaj even the background that each member trots out seems stock and awkward at best . xxmaj an xxup md / xxmaj medicine xxmaj man , a tomboy mechanic whose father always wanted sons , if we have not at least seen these before",neg
2,"xxbos xxmaj some have praised xxunk xxmaj lost xxmaj xxunk as a xxmaj disney adventure for adults . i do n't think so -- at least not for thinking adults . \n\n xxmaj this script suggests a beginning as a live - action movie , that struck someone as the type of crap you can not sell to adults anymore . xxmaj the "" crack staff "" of many older adventure movies has been done well before , ( think xxmaj the xxmaj dirty xxmaj dozen ) but xxunk represents one of the worse films in that motif . xxmaj the characters are weak . xxmaj even the background that each member trots out seems stock and awkward at best . xxmaj an xxup md / xxmaj medicine xxmaj man , a tomboy mechanic whose father always wanted sons , if we have not at least seen these before ,",neg


The DataBlock definition is similiar to previous definition of DataBlock except two chnages

1. `TextBlock.from_folder` no longer has `is_lm=True`
2. We pass the `vocab` we created for the language model fine-tuning.

The reason why we pass the vocab of language model is to make sure we use the same correspondance of token to index. Otherwise the embeddings we learned in our fine-tuned language model won't make any sense to this model and fine-tuning step won't be of any use.

And by not passing `is_lm=True`, we tell `TextBlock` that we have regular labeled data rather than using next tokens as labels.

However one challenge is collating multiple docs into a mini-batch. An example of creating a mini-batch containing first 10 docs but fiirst we need to numericalize them.

In [43]:
nums_samp=toks200[:10].map(num)

Now let's see how many tokens each of 10 reviews have

In [44]:
nums_samp.map(len)

(#10) [207,314,267,378,304,156,197,187,162,229]

PyTorch `DataLoader`s need to collate/arrange all items in a batch into a single tensor and that has fixed shape.

We can't crop the doc because we will lose some info. We can't also squish a doc and there's no data augmentation at resent for NLP. So that leaves padding.

We expand the shortest texts to make them all the same size and to do this, we use a special padding token that is ignored by our model. Additionally, to avoid memory issues and improve performance, we will batch together texts that are roughly the same lengths( with some shuffling for training set). We do this by sortng docs by length prior to each epoch. 

The result of this is that docs collated into a single bath will tend to be of similar lengths. Also we won't pad every batch to same size, but will instead use the size of largest doc in each batch as target size.

Sorting and padding are automatically done by data block API when we use `TextBlock` with `is_lm=True`.

We now create a model to classify our texts.

In [45]:
learn=text_classifier_learner(dls_clas, AWD_LSTM, drop_mult=0.5,
                              metrics=accuracy).to_fp16()

The final step prior to training classifier is to load the encoder from fine-tuned language model. We use `load_encoder` instead of `load` because we only have pretrained weights avaible for encoder.

In [46]:
learn=learn.load_encoder('fine_tuned')

### Fine-Tuning classifier

The last step is to train with discriminative learning rates and gradual unfreezing. In CV, we often unfreeze the model at once, but for NLP classifiers, we find that unfreezing a few layers at a time makes a real difference.

In [47]:
learn.fit_one_cycle(1,2e-2)

epoch,train_loss,valid_loss,accuracy,time
0,0.245921,0.180068,0.93132,00:45


"Freezing" a layer means we're telling the model not to update the weights of that layer during training.

We can pass -2 to freeze_to to freeze all except the last two parameter groups/layers i.e the last two layers will only be trainable.

Meaning train only last two layers.

Negative indices count from end just like Python lists.

In [48]:
learn.freeze_to(-2)
learn.fit_one_cycle(1,slice(1e-2/(2.6**4),1e-2))

epoch,train_loss,valid_loss,accuracy,time
0,0.221303,0.164449,0.93812,00:51


Then unfreeze at bit more and continue training

In [49]:
learn.freeze_to(-3)
learn.fit_one_cycle(1,slice(5e-3/(2.6**4),5e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.190569,0.148871,0.94492,01:07


Finally unfreeze the whole model

In [50]:
learn.unfreeze()
learn.fit_one_cycle(2,slice(1e-3/(2.6**4),1e-3))

epoch,train_loss,valid_loss,accuracy,time
0,0.161605,0.145473,0.94608,01:25
1,0.146006,0.146676,0.94724,01:25


In [53]:
learn.export('text-classifier-model-sentiment-movie-reviews')

Certainly! Let's break down the concept of **discriminative learning rates** and then explain the code you've provided.

Discriminative Learning Rates:

1. Basic Concept:
   - Instead of using a single learning rate for all layers of the network, we use different learning rates for different parts of the model.

2. Rationale:
   - In transfer learning, earlier layers often capture more general features, while later layers are more task-specific.
   - We want to change the earlier layers less (lower learning rate) and the later layers more (higher learning rate).

3. Implementation:
   - Usually, we set lower learning rates for earlier layers and higher rates for later layers.
   - This allows fine-tuning without destroying the useful features learned during pre-training.

Now, let's look at the code:

```python
learn.fit_one_cycle(1, slice(5e-3/(2.6**4), 5e-3))
```

Breaking it down:

1. `learn.fit_one_cycle(1, ...)`:
   - This runs the training for 1 epoch using the "one cycle" policy.
   - One cycle policy involves varying the learning rate and momentum during training for better performance.

2. `slice(5e-3/(2.6**4), 5e-3)`:
   - This creates a range of learning rates.
   - The lower bound is `5e-3/(2.6**4)` ≈ 0.000137
   - The upper bound is `5e-3` = 0.005

3. Discriminative Learning Rates in Action:
   - The `slice` object tells fastai to use a range of learning rates.
   - The earliest layers (those that were frozen) will use the lower learning rate (0.000137).
   - The latest layers will use the higher learning rate (0.005).
   - Layers in between will use learning rates interpolated between these values.

4. The Specific Values:
   - `5e-3` (0.005) is chosen as a reasonable maximum learning rate.
   - Dividing by `2.6**4` for the minimum creates a spread where each layer group's learning rate is about 2.6 times larger than the previous one.

5. Why These Values:
   - The factor of 2.6 and the use of 5e-3 are often empirically determined to work well for many transfer learning tasks.
   - The exact values might be adjusted based on specific model performance.

In summary, this code sets up a training run with discriminative learning rates, allowing different parts of the model to learn at different speeds. This approach often leads to more effective fine-tuning in transfer learning scenarios.

## Conclusion

We saw two models here

1. language model
2. classifier

To build a state-of-the-art classifier, 

- we used a  preatrained language model
- fine-tuned it on corpus of our task
- then used its body(encoder) with a new head to do the classification.