<a href="https://colab.research.google.com/github/IphixLi/Fastai-deep-learning/blob/main/NLP_fastai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [5]:
#hide
! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

In [6]:
#hide
from fastbook import *
from IPython.display import display,HTML

## NLP Deep dive

we can use language models for many NLP tasks such as autocomplete, grammar checks however to do all that effectively we need to fine-tune the given model to target domain. for example it make sense taining on Wikipedia corpus for doman that uses wikipedia data to understand its vocabulary and semantics.

- Universal Language Model Fine-tuning (ULMFit) approach : involves training a language model on a large body of the text first.


Each of the steps necessary to create a language model has jargon associated with it from the world of natural language processing, and fastai and PyTorch classes available to help. The steps are:

- Tokenization:: Convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)
    
    - Word-based: split sentence on spaces and other langauge-specific rules . punctualtion marks in individual tokens
    - Subword-based: spli words in smaller parts.
    - Character-basea

- Numericalization:: Make a list of all of the unique words that appear (the vocab), and convert each word into a number, by looking up its index in the vocab
- Language model data loader creation:: fastai provides an LMDataLoader class which automatically handles creating a dependent variable that is offset from the independent variable by one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required
- Language model creation:: We need a special kind of model that does something we haven't seen before: handles input lists which could be arbitrarily big or small.




In [7]:
# tokenization
### using IMDB data

from fastai.text.all import *
path = untar_data(URLs.IMDB)


In [8]:

files = get_text_files(path, folders = ['train', 'test', 'unsup'])

In [9]:
# part that will be tokenized

txt = files[0].open().read();
txt[:75]

'I am Anthony Park, Glenn Park is my father. First off I want to say that th'

fastai uses SpaCy tokenizer for English.

In [10]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#361) ['I','am','Anthony','Park',',','Glenn','Park','is','my','father','.','First','off','I','want','to','say','that','the','story','behind','this','movie','and','the','creation','of','the','Amber','Alert'...]


In [11]:
first(spacy(['The U.S. dollar 1.00.']))

(#5) ['The','U.S.','dollar','1.00','.']

Now fastai ad additional functionality.
Here are some of the main special tokens you'll see:

    - xxbos:: Indicates the beginning of a text (here, a review)
    - xxmaj:: Indicates the next word begins with a capital (since we lowercased everything)
    - xxunk:: Indicates the word is unknown

they help us to encode additional information for further NLP tasks.
### Example of rules.


In [12]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#393) ['xxbos','i','am','xxmaj','anthony','xxmaj','park',',','xxmaj','glenn','xxmaj','park','is','my','father','.','xxmaj','first','off','i','want','to','say','that','the','story','behind','this','movie','and','the'...]


In [None]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html(x)>,
 <function fastai.text.core.replace_rep(t)>,
 <function fastai.text.core.replace_wrep(t)>,
 <function fastai.text.core.spec_add_spaces(t)>,
 <function fastai.text.core.rm_useless_spaces(t)>,
 <function fastai.text.core.replace_all_caps(t)>,
 <function fastai.text.core.replace_maj(t)>,
 <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]

    - fix_html:: Replaces special HTML characters with a readable version (IMDb reviews have quite a few of these)
    - replace_rep:: Replaces any character repeated three times or more with a special token for repetition (xxrep), the number of times it's repeated, then the character
    - replace_wrep:: Replaces any word repeated three times or more with a special token for word repetition (xxwrep), the number of times it's repeated, then the word
    - spec_add_spaces:: Adds spaces around / and #
    - rm_useless_spaces:: Removes all repetitions of the space character
    - replace_all_caps:: Lowercases a word written in all caps and adds a special token for all caps (xxup) in front of it
    - replace_maj:: Lowercases a capitalized word and adds a special token for capitalized (xxmaj) in front of it
    - lowercase:: Lowercases all text and adds a special token at the beginning (xxbos) and/or the end (xxeos)

In [13]:
coll_repr(tkn('©   Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

## subword Tokenization

Some langauge presentations doesn't really add spaces for separating information.

to handle this, it is better to use subword tokenization

- Analyze a corpus of documents to find the most commonly occurring groups of letters. These become the vocab.
- Tokenize the corpus using this vocab of subword units.


In [14]:
txts = L(o.open().read() for o in files[:2000])


We instantiate our tokenizer, passing in the size of the vocab we want to create, and then we need to "train" it. That is, we need to have it read our documents and find the common sequences of characters to create the vocab. This is done with setup. As we'll see shortly, setup is a special fastai method that is called automatically in our usual data processing pipelines

In [15]:
def subword(sz):
    sp = SubwordTokenizer(vocab_sz=sz)
    sp.setup(txts)
    return ' '.join(first(sp([txt]))[:40])

In [16]:
subword(1000)

'▁I ▁am ▁An th on y ▁P ar k , ▁G le n n ▁P ar k ▁is ▁my ▁father . ▁F ir st ▁off ▁I ▁want ▁to ▁say ▁that ▁the ▁story ▁behind ▁this ▁movie ▁and ▁the ▁creat ion ▁of'


The more the size, the fewer vocabulary we can come up with. hence fast training however, means larger embedding matrices requiring more data to learn.

In [17]:
subword(10000)

'▁I ▁am ▁Anthony ▁Park , ▁Glen n ▁Park ▁is ▁my ▁father . ▁First ▁off ▁I ▁want ▁to ▁say ▁that ▁the ▁story ▁behind ▁this ▁movie ▁and ▁the ▁creation ▁of ▁the ▁Am ber ▁Al er t ▁system ▁is ▁a ▁good ▁one .'

## Numericalization

- Make a list of all possible levels of that categorical variable (the vocab).
- Replace each level with its index in the vocab.

In [18]:
toks200 = txts[:200].map(tkn)
toks200[0]

(#393) ['xxbos','i','am','xxmaj','anthony','xxmaj','park',',','xxmaj','glenn'...]

now pass it to setup to create vocab

In [19]:
num = Numericalize()
num.setup(toks200)
coll_repr(num.vocab,20)

"(#1960) ['xxunk','xxpad','xxbos','xxeos','xxfld','xxrep','xxwrep','xxup','xxmaj','the','.',',','a','and','to','of','is','i','it','in'...]"

Our special rules tokens appear first, and then every word appears once, in frequency order. The defaults to Numericalize are min_freq=3,max_vocab=60000. max_vocab=60000 results in fastai replacing all words other than the most common 60,000 with a special unknown word token, xxunk.

In [20]:
nums = num(toks)[:20]; nums

TensorText([  0, 201,   0,   0,  11,   0,   0,  16,  71, 310,  10,   0, 153,   0, 161,  14, 139,  21,   9,  89])

In [21]:
# we can check for their mapping vocab.
' '.join(num.vocab[o] for o in nums)

'xxunk am xxunk xxunk , xxunk xxunk is my father . xxunk off xxunk want to say that the story'

## putting Texts into Batches for a Langauge Model



In [22]:
#hide_input
stream = "In this chapter, we will go back over the example of classifying movie reviews we studied in chapter 1 and dig deeper under the surface. First we will look at the processing steps necessary to convert text into numbers and how to customize it. By doing this, we'll have another example of the PreProcessor used in the data block API.\nThen we will study how we build a language model and train it for a while."
tokens = tkn(stream)
bs,seq_len = 6,15
d_tokens = np.array([tokens[i*seq_len:(i+1)*seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4,5,6,7,8,9,10,11,12,13,14
xxbos,xxmaj,in,this,chapter,",",we,will,go,back,over,the,example,of,classifying
movie,reviews,we,studied,in,chapter,1,and,dig,deeper,under,the,surface,.,xxmaj
first,we,will,look,at,the,processing,steps,necessary,to,convert,text,into,numbers,and
how,to,customize,it,.,xxmaj,by,doing,this,",",we,'ll,have,another,example
of,the,preprocessor,used,in,the,data,block,xxup,api,.,\n,xxmaj,then,we
will,study,how,we,build,a,language,model,and,train,it,for,a,while,.


So, we need to divide this array more finely into subarrays of a fixed sequence length. It is important to maintain order within and across these subarrays, because we will use a model that maintains a state so that it remembers what it read previously when predicting what comes next.

In [23]:
#hide_input
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15:i*15+seq_len] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
xxbos,xxmaj,in,this,chapter
movie,reviews,we,studied,in
first,we,will,look,at
how,to,customize,it,.
of,the,preprocessor,used,in
will,study,how,we,build


In [24]:
#hide_input
bs,seq_len = 6,5
d_tokens = np.array([tokens[i*15+10:i*15+15] for i in range(bs)])
df = pd.DataFrame(d_tokens)
display(HTML(df.to_html(index=False,header=None)))

0,1,2,3,4
over,the,example,of,classifying
under,the,surface,.,xxmaj
convert,text,into,numbers,and
we,'ll,have,another,example
.,\n,xxmaj,then,we
it,for,a,while,.


## Training a Text Classifier

fastai handles tokenization and numericalization automatically when `TextBlock` is passed to `DataBlock`. All of the arguments that can be passed to `Tokenize` and `Numericalize` can also be passed to `TextBlock`.


In [25]:

get_imdb = partial(get_text_files, folders=['train', 'test', 'unsup'])

dls_lm = DataBlock(
    blocks=TextBlock.from_folder(path, is_lm=True),
    get_items=get_imdb, splitter=RandomSplitter(0.1)
).dataloaders(path, path=path, bs=128, seq_len=80)

  return getattr(torch, 'has_mps', False)


The reason that TextBlock is special is that setting up the numericalizer's vocab can take a long time (we have to read and tokenize every document to get the vocab). To be as efficient as possible it performs a few optimizations:

- It saves the tokenized documents in a temporary folder, so it doesn't have to tokenize them more than once
It runs multiple tokenization processes in parallel, to take advantage of your computer's CPUs

In [26]:

dls_lm.show_batch(max_n=2)

Unnamed: 0,text,text_
0,"xxbos xxmaj if it was n't for some immature gullible idiot i know insisting that i watch this "" documentary "" i would never have seen this comedy ! xxmaj this film is full of bad scripting and laughable moments . xxmaj one in particular is where the xxmaj afghan police / soldiers arrest xxmaj don xxmaj larson for filming in the streets while they allow the cameraman to carry on filming his arrest and then drive away , still","xxmaj if it was n't for some immature gullible idiot i know insisting that i watch this "" documentary "" i would never have seen this comedy ! xxmaj this film is full of bad scripting and laughable moments . xxmaj one in particular is where the xxmaj afghan police / soldiers arrest xxmaj don xxmaj larson for filming in the streets while they allow the cameraman to carry on filming his arrest and then drive away , still filming"
1,"fine mountainous shots as men give chase on horseback and such . xxmaj do n't expect to get your socks blown off , but the film is simple and well paced . xxbos i recently viewed xxmaj manufactured xxmaj landscapes at the xxmaj seattle xxmaj international xxmaj film xxmaj festival . i was drawn to the movie as a photographer because xxmaj i 'm both familiar and a fan of xxmaj burtynsky 's work . xxmaj while i believe the","mountainous shots as men give chase on horseback and such . xxmaj do n't expect to get your socks blown off , but the film is simple and well paced . xxbos i recently viewed xxmaj manufactured xxmaj landscapes at the xxmaj seattle xxmaj international xxmaj film xxmaj festival . i was drawn to the movie as a photographer because xxmaj i 'm both familiar and a fan of xxmaj burtynsky 's work . xxmaj while i believe the movie"


## Fine tuning Langauge model
Then we'll feed those embeddings into a recurrent neural network (RNN), using an architecture called AWD-LSTM

In [27]:
learn = language_model_learner(
    dls_lm, AWD_LSTM, drop_mult=0.3,
    metrics=[accuracy, Perplexity()]).to_fp16()

- The loss function used by default is cross-entropy loss, since we essentially have a classification problem (the different categories being the words in our vocab).

- The perplexity metric used here is often used in NLP for language models: it is the exponential of the loss (i.e., `torch.exp(cross_entropy)`).

- We also include the accuracy metric, to see how many times our model is right when trying to predict the next word, since cross-entropy (as we've seen) is both hard to interpret, and tells us more about the model's confidence than its accuracy.

In [None]:
learn.fit_one_cycle(1, 2e-2)

  return getattr(torch, 'has_mps', False)


epoch,train_loss,valid_loss,accuracy,perplexity,time


  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
  return getattr(torch, 'has_mps', False)
