<a href="https://colab.research.google.com/github/IphixLi/Fastai-deep-learning/blob/main/NLP_fastai.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
#hide
! [ -e /content ] && pip install -Uqq fastbook
import fastbook
fastbook.setup_book()

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m719.8/719.8 kB[0m [31m5.4 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m507.1/507.1 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m6.3 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m7.9 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m8.0 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m4.4 MB/s[0m eta [36m0:00:00[0m
[?25hMounted at /content/gdrive


In [2]:
#hide
from fastbook import *
from IPython.display import display,HTML

## NLP Deep dive

we can use language models for many NLP tasks such as autocomplete, grammar checks however to do all that effectively we need to fine-tune the given model to target domain. for example it make sense taining on Wikipedia corpus for doman that uses wikipedia data to understand its vocabulary and semantics.

- Universal Language Model Fine-tuning (ULMFit) approach : involves training a language model on a large body of the text first.


Each of the steps necessary to create a language model has jargon associated with it from the world of natural language processing, and fastai and PyTorch classes available to help. The steps are:

- Tokenization:: Convert the text into a list of words (or characters, or substrings, depending on the granularity of your model)
    
    - Word-based: split sentence on spaces and other langauge-specific rules . punctualtion marks in individual tokens
    - Subword-based: spli words in smaller parts.
    - Character-basea

- Numericalization:: Make a list of all of the unique words that appear (the vocab), and convert each word into a number, by looking up its index in the vocab
- Language model data loader creation:: fastai provides an LMDataLoader class which automatically handles creating a dependent variable that is offset from the independent variable by one token. It also handles some important details, such as how to shuffle the training data in such a way that the dependent and independent variables maintain their structure as required
- Language model creation:: We need a special kind of model that does something we haven't seen before: handles input lists which could be arbitrarily big or small.




In [3]:
# tokenization
### using IMDB data

from fastai.text.all import *
path = untar_data(URLs.IMDB)


In [4]:

files = get_text_files(path, folders = ['train', 'test', 'unsup'])

In [11]:
# part that will be tokenized

txt = files[0].open().read();
txt[:75]

'I am Anthony Park, Glenn Park is my father. First off I want to say that th'

fastai uses SpaCy tokenizer for English.

In [9]:
spacy = WordTokenizer()
toks = first(spacy([txt]))
print(coll_repr(toks, 30))

(#361) ['I','am','Anthony','Park',',','Glenn','Park','is','my','father','.','First','off','I','want','to','say','that','the','story','behind','this','movie','and','the','creation','of','the','Amber','Alert'...]


In [12]:
first(spacy(['The U.S. dollar 1.00.']))

(#5) ['The','U.S.','dollar','1.00','.']

Now fastai ad additional functionality.
Here are some of the main special tokens you'll see:

    - xxbos:: Indicates the beginning of a text (here, a review)
    - xxmaj:: Indicates the next word begins with a capital (since we lowercased everything)
    - xxunk:: Indicates the word is unknown

they help us to encode additional information for further NLP tasks.
### Example of rules.


In [18]:
tkn = Tokenizer(spacy)
print(coll_repr(tkn(txt), 31))

(#393) ['xxbos','i','am','xxmaj','anthony','xxmaj','park',',','xxmaj','glenn','xxmaj','park','is','my','father','.','xxmaj','first','off','i','want','to','say','that','the','story','behind','this','movie','and','the'...]


In [15]:
defaults.text_proc_rules

[<function fastai.text.core.fix_html(x)>,
 <function fastai.text.core.replace_rep(t)>,
 <function fastai.text.core.replace_wrep(t)>,
 <function fastai.text.core.spec_add_spaces(t)>,
 <function fastai.text.core.rm_useless_spaces(t)>,
 <function fastai.text.core.replace_all_caps(t)>,
 <function fastai.text.core.replace_maj(t)>,
 <function fastai.text.core.lowercase(t, add_bos=True, add_eos=False)>]

    - fix_html:: Replaces special HTML characters with a readable version (IMDb reviews have quite a few of these)
    - replace_rep:: Replaces any character repeated three times or more with a special token for repetition (xxrep), the number of times it's repeated, then the character
    - replace_wrep:: Replaces any word repeated three times or more with a special token for word repetition (xxwrep), the number of times it's repeated, then the word
    - spec_add_spaces:: Adds spaces around / and #
    - rm_useless_spaces:: Removes all repetitions of the space character
    - replace_all_caps:: Lowercases a word written in all caps and adds a special token for all caps (xxup) in front of it
    - replace_maj:: Lowercases a capitalized word and adds a special token for capitalized (xxmaj) in front of it
    - lowercase:: Lowercases all text and adds a special token at the beginning (xxbos) and/or the end (xxeos)

In [21]:
coll_repr(tkn('©   Fast.ai www.fast.ai/INDEX'), 31)

"(#11) ['xxbos','©','xxmaj','fast.ai','xxrep','3','w','.fast.ai','/','xxup','index']"

## subword Tokenization

Some langauge presentations doesn't really add spaces for separating information.

to handle this, it is better to use subword tokenization

- Analyze a corpus of documents to find the most commonly occurring groups of letters. These become the vocab.
- Tokenize the corpus using this vocab of subword units.


In [22]:
txts = L(o.open().read() for o in files[:2000])